On Building Blocks for Distributed Systems 

by 
Roberto De Prisco 

Dottorato in Applied Mathematics and Computer Science 
University of Naples, Italy (1998) 

M.S. in Electrical Engineering and Computer Science 
Massachusetts Institute of Technology (1997) 

Laurea in Computer Science 
University of Salerno, Italy (1991) 

Submitted to the Department of Electrical Engineering and Computer Science 
in partial fulfillment of the requirements for the degree of 

Doctor of Philosophy in Electrical Engineering and Computer Science 

at the 

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 

December 1999 

© Massachusetts Institute of Technology. All rights reserved. 



Author 



Department of Electrical Engineering and Computer Science 

December 10, 1999 



Certified by . 



Prof. Nancy Lynch 
NEC Professor of Software Science and Engineering 

Thesis Supervisor 



Accepted by 



Prof. Arthur C. Smith 
Chairman, Department Committee on Graduate Theses 



On Building Blocks for Distributed Systems 

by 
Roberto De Prisco 



Submitted to the Department of Electrical Engineering and Computer Science 

on December 10, 1999, in partial fulfillment of the 

requirements for the degree of 

Doctor of Philosophy in Electrical Engineering and Computer Science 

Abstract 

In this thesis we have investigated two building blocks for distributed systems: group communication 
services and distributed consensus services. 

Using group communication services is a successful approach in developing fault tolerant dis- 
tributed applications. Such services provide communication tools that greatly facilitate the devel- 
opment of applications. Though many existing systems are used in real world applications, there 
is still the need of providing formal specifications for the group communication services offered by 
these systems. Great efforts are being made by many researchers to provide such specifications. In 
this thesis we have tackled this problem and have provided specifications for group communication 
services. One of our specifications considers the notion of primary view; another one generalizes 
this notion to that of primary configurations (views with quorums). Both specifications are shown 
to be implementable. The usefulness of both specifications is demonstrated by applications running 
on top of them. Our specifications are tailored to dynamic systems, where processes join and leave 
the system even permanently. We also showed how the approach used to develop the specifications 
can be applied to transform known algorithms, designed for stating settings, in order to make them 
adaptable to dynamic systems. 

Distributed consensus is the abstraction of many coordination problems, which are of fundamen- 
tal importance in distributed systems. Distributed consensus has been thoroughly studied and one 
important result showed that it is not possible to solve consensus in asynchronous systems if failures 
are allowed. However in such systems it is possible to solve the fc-set consensus problem, which is 
a relaxed version of the consensus problem: each participating process begins the protocol with an 
input value and by the end of the protocol it must decide on one value so that at most k total values 
are decided by all correct processes (the classical consensus problem requires that there be a unique 
value decided by all correct processes). In this thesis we have investigated the fc-set consensus prob- 
lem in asynchronous distributed systems. We extended previous work by exploring several variations 
of the problem definition and model, including for the first time investigation of Byzantine failures. 
We showed that the precise definition of the validity requirement, which characterizes what decision 
values are allowed as a function of the input values and whether failures occur, is crucial to the 
solvability of the problem. We introduced six validity conditions for this problem (all considered in 
various contexts in the literature), and we demarcated the line between possible and impossible for 
each case. In many cases this line is different from the one of the originally studied fc-set consensus 
problem. 
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Chapter 1 



Introduction 



In the last decade the impact of distributed systems on computing has been tremendous. Nowadays 
no single workstation is stand-alone; even personal computers in homes are connected with the rest 
of the computing world by means of the Internet. Computer interconnections can be classified at 
various levels, depending on the kind of interaction required by the components that are connected. 
The connection can be as simple as a cable connecting two computers and as complex as the Internet 
which literally connects millions of computers around the world. The more distributed is the system 
the more complex is the interconnection. Because of this, distributed systems are more subject to 
failures than stand-alone computers. Also, distributed systems are harder to program because of the 
difficulties deriving from sharing data, sharing resources, and coordinating work. Thus developing 
distributed applications is a complex task. 

A popular and successful approach to managing this complexity is to decompose the system 
design, by constructing the system from pre-defined communication, synchronization, and memory 
building blocks. These building blocks may represent global (that is, system- wide) or local services; 
they may be combined in parallel or may represent different levels of abstraction. The structure 
they provide makes the systems easier to build, to use, and to modify. Examples of such building 
blocks already in use are various types of group membership and group communication services; 
failure detection, leader election, consensus, and atomic commit services; resource allocation and 
synchronization services; and various forms of strongly consistent and weakly consistent shared 
memory. 

In this thesis we investigate two such building blocks, namely, group communication services and 
distributed consensus services. 



1.1 Group communication services 

1.1.1 Overview and related work 

Recently, view-oriented group communication services (see [1] for a survey) have been of particular 
interest. Such a service allows application processes located at different nodes of a distributed 
network to operate collectively as a group, using the service to multicast messages to all members 
of the group. The group communication service is based on a group membership service, which 
provides each group member with a view of the group, including a list of the processes that are group 
members. Messages sent by a process in a view are delivered only to processes that are members of 
that view, and only when they have that view. Within each view, the service offers guarantees about 
the order and reliability of message delivery. Thus, a view-oriented group communication service 
manages both consistent delivery of messages within each view and the reconfiguration involved in 
changing views. Examples of such services are found in the Isis [19], Transis [6, 33], Ensemble [49], 
Newtop [38] and Relacs [14] systems as well as in other systems. Typical applications that use 
view-oriented group communication services include state-machine replication (e.g., [46, 45, 57]), 
distributed transactions and database replication (e.g., [85, 48, 56]), system management (e.g., [4]) 
and monitoring (e.g., [3]), video-on-demand servers (e.g., [10, 9]), collaborative computing, such as, 
distance learning, audio and video conferences, application sharing and even distributed musical 
"jam sessions" (see [87] for more references). 

In order to be most useful group communication services (as well as other building blocks) re- 
quire clear and precise specifications of their guaranteed behavior. Such specifications would allow 
application programmers to think carefully about the behavior of systems that use the primitives, 
without having to understand how the primitives themselves are implemented. Unfortunately, pro- 
viding appropriate specifications for group communication services is not an easy task. Some of 
these services are rather complicated, and there is still no agreement about exactly what the guar- 
antees should be. Different specifications arise from different implementations of the same service, 
because of differences in the safety, performance, or fault-tolerance that is provided. Moreover, the 
specifications that most accurately describe particular implementations may not be the ones that 
are easiest for application programmers to use. 

Hence providing group communication specifications is a serious challenge, and requires theo- 
retical work to support the system development work. Such work has included formal specification 
of global membership and communication services (e.g., [17, 23, 25, 27, 34, 72, 81, 83]), design and 
analysis of distributed algorithms that implement or exploit such services (e.g., [57, 7, 88]), and even 
impossibility results (e.g., [23]). 

The Isis system introduced the important concept of virtual synchrony [19]. This concept has 
been interpreted in various ways, but an essential requirement is that if a particular message is 
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delivered to several processes, then all have the same view of the membership when the message is 
delivered. This allows the recipients to take coordinated action based on the message, the member- 
ship set and the rules prescribed by the application. 

The Isis system was designed for an environment where processors might fail and messages might 
be lost, but where the network does not partition. This assumption might be reasonable for some 
local area networks, but it is not valid in wide area networks. Therefore, the more recent systems 
mentioned above allow the possibility that concurrent views of the group might be disjoint. 

The first major work on the development of specifications for fault-tolerant group-oriented mem- 
bership and communication services appears to be that of Ricciardi [81], and the research area is 
still active (see, e.g., [75, 23, 88, 41]). In particular, there has been a large amount of work on devel- 
oping specifications for partitionable group services. Some specifications deal just with membership 
and views [54, 86] while others also cover message services (ordering and reliability properties) 
[72, 16, 15, 26, 34, 44, 53]. These specifications are all complicated, many are difficult to under- 
stand, and some seem to be ambiguous. It is not clear how to tell whether a specification is sufficient 
for a given application. It is not even clear how to tell whether a specification is implementable at 
all; impossibility results such as those in [23] demonstrate that this is a significant issue. Vitenberg 
et al. [87] provide a comprehensive study and comparison of several existing group communication 
specifications. 

Among previous work the specification of a group communication service provided in [41] is 
particularly relevant to the work done in this thesis. The group communication service specification 
provided in [41], called vs, captures what seem to be the basic property of a view oriented group 
communication service: processes are provided with views of the system and communication is view- 
oriented, meaning that messages sent in a particular view are delivered only within that view. We 
remark that this is not the only property of existing systems, but it is the most important one. The 
strength of the VS specification lies in its simplicity. Yet the specification is powerful enough to build 
applications on top of it. 

Providing specifications that are simple enough to be implementable and strong enough to be 
usable in applications is the key for designing good building blocks. Previous work has shown that 
this is not an easy task: some specifications are too strong to be implementable, e.g. [23], and 
some of them fail to capture the non-triviality of existing group communication services. The VS 
specification has been proven to be implementable and useful as building block for more powerful 
services. Lesley and Fekete [63] have proved that a version of an algorithm of Cristian and Schmuck 
[27] implements the vs service. Khazan et al. [58, 59] have used vs in the design of a load-balancing 
database algorithm and in [41] a totally ordered broadcast service is built on top of vs. 

We have mentioned that real distributed systems can partition. When a partition occurs the main 
problem to be faced is that of maintaining consistency of replicated data. To cope with partitions, 
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usually the application processes perform significant computations only when they have a special 
type of view called a primary view. For example, a replicated database application might only 
perform a read or write operation within a primary view, in order to ensure that each read receives 
the result of the last preceding write, in some consistent order of the operations. In this setting, a 
primary view is typically defined to be one whose membership comprises a majority of the universe 
of processes. The intersection property guaranteed by majorities permits information flow from any 
previous primary to a newly formed one. 

This thesis focuses on primary view group communication services because many real applications 
do need to maintain consistency of replicated data. However, applications that can tolerate some 
degree of inconsistency can use partitionable group communication services. For example, in a 
shared white-board application, a partition would result in users seeing only whatever is written by 
users in their component. When (possibly) the partition is recovered the white-board can display 
information written from each component (maybe with some criteria to merge the different white- 
boards of different components). Another example is a distributed booking system for airline tickets. 
If the system partitions into two (or more) components each of them could still accept reservations, 
provided that the airline is willing to face over-booking. 

In distributed applications involving replicated data, a well known way to enhance the availability 
and efficiency of the system is to use quorums. A quorum system is a set of subsets of the members 
of the system which satisfy the property that any two sets intersect. We refer to a view with a 
quorum system defined over the members of the view as a configuration. Using configurations an 
update can be performed with only a quorum available, while with an ordinary view all of the 
members must be available. The intersection property of quorums permits one to maintain data 
consistency, within a given configuration. Quorum systems have been extensively studied and used 
in applications, e.g. [2, 37, 47, 50, 79, 70, 8, 74]. The use of quorums has been proven effective also 
against Byzantine failures [68, 69]. 

Pre-defined quorum sets can yield efficient implementations in settings which are relatively static, 
i.e., failures are transient. However they work less well in settings where processes routinely join and 
leave the system, or where the system can suffer multiple partitions. For such a setting, a dynamic 
notion of primary is needed. A dynamic notion of primary still needs to maintain some kind of 
intersection property, in order to permit enough information flow between successive primary views 
to achieve coherence. For example, each primary view might have to contain at least a majority of 
the processes in the previous primary view. Several dynamic voting schemes have been developed 
to define primaries adaptively, e.g. [28, 36, 55, 88, 77]. 

In particular, Yeger Lotem, Keidar, and Dolev [88] have described an implementation of a group 
membership service that yields only primary views, according to a dynamic notion of primary. An 
interesting feature of their work is that it points out various subtleties of implementing such a 
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membership service in a distributed manner - subtleties involving different opinions by different 
processes about what is the previous primary view. These difficulties have led to errors in some of 
the past work on dynamic voting. The algorithm of [88] copes with these subtleties by maintaining 
information about a collection of primary views that "might be" the previous primary view. The 
service deals with group membership only, and not with communication. 

1.1.2 Work in this thesis 

In this thesis we have provided group communication specifications which handle primary views 
and primary configurations; the latter required extending the notion of primary view to that of 
primary configuration. We have proved that the specifications are implementable, by exhibiting 
algorithms that implement them, and useful as building block for more powerful services, by pro- 
viding algorithms that implement these more powerful services exploiting the group communication 
services. 

Dynamic views 

We have provided a group communication service, called dvs, that integrates the vs group com- 
munication service with a dynamic primary view membership service, yielding a dynamic primary 
view group communication service. The DVS service is inspired by the implementation of [88], but 
integrates communication with the group membership service. 

An important feature of the dvs specification is the careful handling of the interface between 
the service and the application. When a new view starts, applications generally require some pre- 
processing, typically, an exchange of information, to prepare for ordinary computation. For example, 
processes in a coherent database application may need to exchange information about previous 
updates in order to bring everyone in the new view up to date. We expect each application process 
to "register" a new view v when it has completed this pre-processing for view v. The dvs service uses 
registration information when it creates a new view v, in order to determine which previously-created 
views must satisfy the intersection property with respect to v. When all members have registered v, 
the application has gathered all information it needs from previous views, and the service no longer 
needs to ensure intersection in membership between views before v and any subsequent ones that 
are formed. 

Another feature of the dvs specification, compared to that of Yeger Lotem et al. [88], is that 
our specification is given as an automaton, which maintains state information about the views 
and the messages sent in each view. This global state can be used in invariants and abstraction 
functions, leading to assertional proofs of the correctness of implementations of dvs, and also of 
applications built over dvs. In contrast, Yeger Lotem et al. use a specification given in terms of 
the whole sequence of events in an execution, and therefore must use operational reasoning about 

13 



complex sequences of events. Extensive experience with proofs of distributed algorithms suggests 
that assertional techniques are less error-prone; also they are more amenable to automated checking. 

We have demonstrated the value of the dvs specification by showing both how it can be imple- 
mented and how it can be used in an application. Both pieces are shown formally, with assertional 
proofs. 

The implementation is a variant of the group membership algorithm of [88]. We have proved 
that this algorithm implements dvs, in the sense of trace inclusion, that is, the external behavior of 
the implementation is allowed by the dvs specification. The proof uses a (single- valued) simulation 
relation and invariant assertions. The key to the proof is an invariant expressing a strong condition 
about nonempty intersections of views; the proof of this depends on relating a local check of majority 
intersection with known views to a global check of nonempty intersection with existing views. 

We have also provided an application algorithm that is a variant of an algorithm in [57, 7, 41], 
modified to use dvs instead of a static view-oriented service. The modified algorithm uses the 
registration capability to tell the dvs service that information has been successfully exchanged at 
the beginning of a new view. We show that it implements a (non-group-oriented) totally-ordered- 
broadcast service. This proof also uses a simulation relation and invariant assertions. 

We have designed our dvs specification to express the guarantees that we think are useful in 
verifying correctness of applications that use the service. Among previous work, two different sorts 
of specifications for a primary group service are notable. Work by Ricciardi and others [83] is 
expressed in terms of temporal logic on consistent cuts; the idea of their specification is that on any 
cut, there are no disjoint sets of processes such that each set is collectively aware of no members 
outside that set. Yeger Lotem et al. [88] use a property of an execution, which was previously defined 
by Cristian [26] for majority groups: any two primary views are linked by a chain of views where 
every consecutive pair of views includes a process that "knows" it belongs to both views. As far 
as we know, these previous specifications have not been used to verify properties of applications 
running above them. 

The DVS specification omits some properties of existing dynamic primary view management 
algorithms. For example, Isis [19] guarantees that processes that move together from one view to 
the next receive exactly the same messages in the first view. Guaranteeing this property requires 
state exchange within the view management service. This property is not needed to verify properties 
of other applications, such as the totally-ordered broadcast service of [41]. Also, our service provides 
no explicit support for application-level state exchange. Real systems, e.g. Isis, do provide such 
support, by allowing application- level state exchange messages to be piggy-backed on the lower-level 
state exchange messages. 
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Dynamic configurations 

Quorum-based methods for managing replicated data are popular because they provide availability 
of both reads and writes in the presence of faulty behavior by some sites or communication links. A 
quorum system is also called a configuration. If a system lasts for a very long time, it may become 
necessary to alter the configuration, perhaps because some sites have failed permanently and others 
have joined the system, or perhaps because users want a different trade-off between read-availability 
and write-availability. For example, if more sites join the system, these sites must be included in 
the quorums in order to use them; If many sites fail permanently, these sites must be taken out 
of the quorums in order to maintain availability. The most common proposal has been to use a 
two-phase commit protocol which stops all application operations while all sites are notified of the 
new configuration. Since two-phase commit is a blocking protocol, this solution is vulnerable to a 
single failure during the configuration change. An alternative proposal in [66] has reconfiguration 
directed by a single site, thus this is also not fault-tolerant. In a setting of database transactions, 
[47] showed how to integrate fault-tolerant updates of replicated information about quorum sizes 
(using the same quorums for both data item replicas and quorum information replicas) . 

Herlihy [51] provides algorithms to shrink and enlarge quorums within a static universe of proces- 
sors; the setting considered in [51] does not allow processors to join and leave the system. Lamport 
discusses how to modify his paxos algorithm [61] in order allow processors to join and leave the 
system. In this thesis we integrate these aspects in a group communication framework. 

There are subtle issues that arise in managing the change of configurations, including how to 
make sure that any operation using the new configuration is aware of all information from operations 
that used an old configuration, and how to allow concurrent attempts to alter the configuration. 

In this thesis we have addressed this problem by extending dynamic primary views group commu- 
nication services to handle configurations. The main difficulty in combining configurations with the 
notion of dynamic primary view is the intersection property required to maintain consistency among 
data stored at different sites. A dynamic primary view must intersect the previous one in at least 
a quorum of processors (this property is required, for example, by replicated data applications in 
order to keep all the replicas consistent) . With configurations this intersection property that works 
for primary views, is no longer enough. Indeed updated information might be only at a quorum and 
the processors in the intersection might be not in that quorum. A stronger intersection property is 
required. We have proposed one possible intersection property that allows applications to keep data 
consistency across changes of primary configurations. Namely, we require that there be a quorum of 
the old primary configuration which is included in the membership set of the new primary configu- 
ration. This guarantees that there is at least one process in the new primary configuration that has 
the most up to date information. This, similarly to the intersection property of dynamic primary 
views, allows flow of information from the old configuration to the new one and thus permits one to 
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preserve data consistency. 

We actually considered a more specialized version of configurations which uses two sets of quo- 
rums, a set of read quorums and a set of write quorums, with the property that any read quorum 
intersects any write quorum. (This choice is justified by the application we develop, an atomic 
read/write register.) With this kind of configuration the intersection property that we require for a 
new primary configuration is that there be one read quorum and one write quorum both of which 
are included in the membership set of the new primary configuration. The use of read and write 
quorums (as opposed to just quorums) can be more efficient in order to balance the load of the 
system (e.g., [37]). 

The resulting dynamic primary configuration group communication service is called DC. This 
service also integrates support for state exchange into the DC specification. This improves the 
modularity of the building block. When a new configuration starts, applications generally require 
some pre-processing, such as an exchange of information, to prepare for ordinary computation. 
Typically this is needed in order to bring every member of the configuration up to date. For 
example, processes in a coherent database application may need to exchange information about 
previous updates in order to bring everyone in the new configuration up to date. We will refer to 
the up-to-date state of a new configuration as the starting state of that configuration. The starting 
state is the state of the computation that all members should have in order to perform regular 
computation. The computation of the starting state should be offered by the communication service 
so that applications do not have to worry about the details of the underlying state exchange. We 
have demonstrates the value of the DC specification by showing both an algorithm that implements 
DC and how DC can be used in an application. The implementation is based on a variant of the 
group membership algorithm of [88]. The application is an atomic read/ write shared register, and 
is similar to the work of of Attiya, Bar-Noy and Dolev [12] and of Lynch and Shvartsman [66]. 

Dynamic algorithms 

We have investigated the use of the technique introduced to design the dvs and DC services to 
transform services and applications that are designed for "static" settings, into ones that work well 
in "dynamic" settings. 

We used a variant of the DC service to provide a dynamic version of the Lamport's paxos 
algorithm [61]. The PAXOS algorithm solves a fundamental problem of distributed computing: the 
consensus problem. In such a problem processors of a distributed system start computation with 
an input value and have to make an irreversible decision guaranteeing agreement, which requires 
that all decisions are the same, and validity, which requires that any decision is equal to some input 
value. 

The PAXOS algorithm tolerates many types of failures: timing failures, loss, duplication and 
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reordering of messages and process stopping failures. Process recoveries are considered; some stable 
storage is needed, paxos is guaranteed to work safely, that is, to satisfy agreement and validity, 
regardless of process, channel and timing failures and process recoveries. When the distributed 
system stabilizes, meaning that there are no failures nor process recoveries and a majority of the 
processes are not stopped, for a sufficiently long time, termination is achieved; the performance of 
the algorithm when the system stabilizes is good. 

The original algorithm is designed for static settings, where failures are transient, that is, failed 
processors recover. If a majority (or a quorum) of the processors is not available the system is 
blocked. If such a majority or quorum permanently leaves the system, then the system is blocked 
forever. The variant we have designed adapts well to permanent changes of the underlying distributed 
system. 

The PAXOS algorithm bears many similarities with an earlier algorithm of Liskov and Oki [76]. 
The work of Liskov and Oki uses a notion of "view" which changes when a new primary site needs to 
be selected. The notion of "view" and that of "view synchrony" has later been proven very successful 
(see the overview and related work of this section). 

We have also provided a dynamic primary copy data replication algorithm. As the dynamic 
version of paxos also this algorithm is based on a variant of the DC service. This algorithm uses 
a centralized approach in which a "leader" process is responsible for providing responses to client's 
queries. In order to keep consistency this leader process replicates (part of) its own state to a 
quorum of processes. The algorithm exploits the quorum-oriented framework provided by the DC 
service. We sketch the proof of correctness of this algorithm; the technique used to prove correct 
other applications developed in this thesis should apply also to this algorithm. 

1.2 Distributed consensus 

1.2.1 Overview and related work 

Another important building block for distributed systems is distributed consensus. Such a prob- 
lem arises in many forms and various contexts, such as, for example, distributed data replication, 
distributed databases, flight control systems. Data replication is used in practice to provide high 
availability: having more than one copy of the data allows easier access to the data, i.e., the near- 
est copy of the data can be used. However, consistency among the copies must be maintained. A 
consensus algorithm can be used to maintain consistency. A practical example of the use of data 
replication is an airline reservation system. The data consists of the current booking information 
for the flights and it can be replicated at agencies spread over the world. The current booking 
information can be accessed at any of the replicas. Reservations or cancellations must be agreed 
upon by all the copies. 
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In a distributed database, the consensus problem arises when a collection of processes participat- 
ing in the processing of a distributed transaction has to agree on whether to commit or abort the 
transaction, that is, make the changes due to the transaction permanent or discard the changes. 
A common decision must be taken to avoid inconsistencies. A practical example of the use of dis- 
tributed transactions is a banking system. Transactions can be done at any bank location or ATM 
machine, and the commitment or abortion of each transaction must be agreed upon by all the bank 
locations or ATM machines involved. 

In a flight control system, the consensus problem arises when the flight surface and airplane control 
systems have to agree on whether to continue or abort a landing in progress or when the control 
systems of two approaching airplanes need to modify the air routes to avoid collision. 

Distributed consensus has been extensively studied; a good survey of early results is provided in 
[42]. We refer the reader to [65] for a more up-to-date treatment of consensus problems. 

One of the most celebrated result about distributed consensus is the impossibility result of 
Fischer, Lynch and Paterson [43]. This impossibility result, popularly known as FLP, states that it 
is impossible to achieve distributed consensus in asynchronous systems even if only one stop failures 
is possible. This surprising result sparked various directions of research aimed to solve the problem 
by either restricting the asynchrony of the computation model (e.g. [31, 35]) or using randomized 
protocols (e.g. [18, 21, 80]) or weakening the problem definition (e.g. [24, 32, 39, 40]). 

The last of these three directions of research falls in the more general research area of demarcat- 
ing what is deterministically computable and what is deterministically impossible in asynchronous 
distributed systems in the presence of failures. The FLP impossibility seemed to suggest that no 
nontrivial problem could be solved deterministically and asynchronously in the presence of faults. 
Attiya, Bar-Noy, Dolev, Peleg and Reischuk [13] showed that the renaming problem can be solved 
in a deterministic way in asynchronous system in the presence of failures. Informally, in the renam- 
ing problem processors start the computation with a "name" taken from some unbounded ordered 
name space and have to "rename" themselves with names chosen from a new small name space. 
This result revived the research trend of exploring computable and impossible in deterministic asyn- 
chronous distributed systems subject to failures. Following this direction, Chaudhuri [24] defined 
the fc-consensus problem, which is a natural generalization of the consensus problem obtained by 
allowing processes to decide on k different values, instead of requiring them to agree on a single 
value. The 1-consensus problem is the classical consensus problem. 

Chaudhuri provided an algorithm to solve the fc-consensus problem that tolerates up to a thresh- 
old t of process failures strictly smaller than k. This result proved that the fc-consensus problem, 
for k > 2, allows more resilience than the 1-consensus problem. Chaudhuri conjectured that the 
fc-consensus problem was impossible to solve while tolerating k or more failures. This conjecture was 
proven true by three independent research teams: Borowsky and Gafni [20], Herlihy and Shavit [52] 



18 



and Saks and Zaharoglou [84]. Attiya [11] provided an alternative proof of the same result. 

The results of [24, 20, 52, 84] completely characterize the fc-consensus problem in asynchronous 
systems with stop failures. In such a model the fc-consensus problem is solvable if and only if t < k. 

The formal definition of the fc-consensus problem requires three conditions to be satisfied: agree- 
ment, termination and validity. The agreement condition requires that each process decide on a 
value in such a way that the set of decided values has cardinality at most k. The termination con- 
dition simply requires that each (correct) process decide. For what concern the validity condition, 
several variants have been considered in the literature. The validity condition used in [24, 20, 52, 84] 
requires that each of the decision be equal to some input value. 

An alternative definition of the validity condition considered for the 1-consensus problem with 
stop failures requires that if all the inputs to the processes of the systems are equal then any decision 
must be equal to the input (see, for example, Chapter 6 of [65]). 

In a Byzantine environment faulty processes can "mask" their inputs. Hence a more suitable 
validity condition considered for the 1-consensus problem with Byzantine failures requires that if all 
the correct processes have the same input then any decision be the input of a correct process [62, 78]. 

1.2.2 Work in this thesis 

In this thesis we have explored several alternative validity conditions and we consider the fc-consensus 
problem in asynchronous systems both with stop failures and with Byzantine failures. We have 
considered six different definitions for the validity condition of the fc-consensus problem. In many 
cases the validity condition makes a difference. We have considered the six variations of the k- 
consensus problem both in the stop failure case and in the Byzantine failure case. This lead to twelve 
different problems. One of this is the fc-consensus problem considered in [24, 20, 52, 84]. Hence for 
this problem we already know the line that separate solvable from impossible (the problem is solvable 
if and only if the number of allowed failures is strictly less than k). For the other variations of the 
problem and in particular for the Byzantine settings, the line between impossible and possible was 
not known. We have demarcated these lines. 

1.3 Summary of contributions 

This thesis provided new formal specifications for group communication services. The specifications 
are shown to be implementable and useful to build applications. The significance of this work is 
two-fold: on one hand it is a contribution in the identification of useful formal specifications for 
group communication services, a research area very active recently; on the other hand we have 
explored the possibility of integrating into a single group communication building block the notion 
of primary view and that of configuration, both of which are well known but never have been used 
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together. The specifications we have provided are tailored to dynamic systems, where processors 
join and leave the system routinely and possibly permanently. The approach used to design such a 
dynamic services has been applied also to transform known algorithms, designed for stating settings, 
in order to make them adaptable to dynamic systems. 

This thesis investigated also some theoretical aspects of another important building block for dis- 
tributed systems: distributed consensus. We extended previous work by exploring several variations 
of the problem definition and model, including for the first time investigation of Byzantine failures. 
We showed that the precise definition of the validity requirement, which characterizes what decision 
values are allowed as a function of the input values and whether failures occur, is crucial to the 
solvability of the problem. We introduced six validity conditions for this problem (all considered in 
various contexts in the literature), and we demarcated the line between possible and impossible for 
each case. In many cases this line is different from the one of the originally studied k-set consensus 
problem. 

1.4 Thesis roadmap 

The rest of thesis is divided into two parts. The first part is dedicated to group communication 
services while the second part studies the consensus problem. 

Part I (group communication services) is structured as follows. Chapter 2 contains an overview 
of group communication services. Chapter 3 contains notation and terminology used throughout the 
rest of the part and introduces the I/O automaton model, which is used to provide the specifications, 
the implementations and the applications. Chapter 4 describes the vs service of [41]; such a service is 
used as building block for the implementations of the group communication services provided in this 
thesis. Chapter 5 contains the DVS specifications, a specification for a dynamic primary view group 
communication service, together with an implementation and a totally ordered broadcast service 
running on top of DVS. Chapter 6 contains the DC specifications, a specification for a dynamic 
primary configuration group communication service, together with an atomic read/write register 
implemented top of DC. Chapter 7 provides a version of Lamport's paxos algorithm implemented 
on top of a variation of the DC service. Finally Chapter 8 provides concluding remarks for Part I. 

Part II (distributed consensus) is structured as follows. Chapter 9 contains an introduction to 
the problem. Chapter 10 describes the model of computation and provides a formal definition of the 
problem. Chapters 11 and 12 study the fc-set consensus problem in the crash failures and Byzantine 
failures models, respectively. Finally, Chapter 13 provides concluding remarks for Part II. 
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Part I 



Group Communication Services 
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Chapter 2 

Group Communication Services: 
Overview 



Developing distributed applications is a difficult task, because of the complexities of the applications 
themselves and of the fault-prone distributed settings in which they run. Considerable effort is de- 
voted to making distributed applications robust in the face of typical processor and communication 
failures. A successful approach to overcome these difficulties is to modularize the system by im- 
plementing suitable building blocks that provide powerful general-purpose distributed computation 
services. 

Among the most important examples of building blocks are group communication services. Group 
communication services enable processes located at different nodes of a distributed network to op- 
erate collectively as a group. The processes do this by using a group communication service to 
multicast messages to all members of the group. Different group communication services offer differ- 
ent guarantees about the order and reliability of message delivery. Examples are found in Isis [19], 
Transis [33], Totem [73], Newtop [38], Relacs [14] and Horus [86]. 

The basis of a group communication service is a group membership service. Each process, at each 
time, has a unique view of the membership of the group. The view includes a list of the processes 
that are members of the group. Views can change from time to time, and may become different at 
different processes. Isis introduced the important concept of virtual synchrony [19]. This concept 
has been interpreted in various ways, but an essential requirement is that if a particular message 
is delivered to several processes, then all have the same view of the membership when the message 
is delivered which is also the view where the message was sent. This allows the recipients to take 
coordinated action based on the message, the membership set and, obviously, the application. 

To be most useful to application programmers, system building blocks should come equipped 
with simple and precise specifications of their guaranteed behavior. Such specifications would allow 
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application programmers to think carefully about the behavior of systems that use the primitives, 
without having to understand how the primitives themselves are implemented. Unfortunately, pro- 
viding appropriate specifications for group communication services is not an easy task. Some of 
these services are rather complicated, and there is still no agreement about exactly what the guar- 
antees should be. Different specifications arise from different implementations of the same service, 
because of differences in the safety, performance, or fault-tolerance that is provided. Moreover, the 
specifications that most accurately describe particular implementations may not be the ones that 
are easiest for application programmers to use. Example of specifications for group membership and 
communication services can be found in [17, 23, 25, 27, 34, 72, 81, 83]). 

In distributed application involving replicated data, a well known way to enhance the availability 
and efficiency of the system is to use quorums. A quorum system is a set of subsets of the members of 
the system which satisfy the property that any two sets intersect. We refer to a view with a quorum 
system as a configuration. Using configurations an update can be performed with only a quorum 
available, while with an ordinary view all of the members must be available. The intersection 
property of quorums guarantees consistency within a given configuration. Quorum systems have 
been extensively studied and used in applications (e.g., [2, 37, 47, 50, 70, 74]). 

Pre-defined quorum sets can yield efficient implementations in settings where the system is 
relatively static, that is, failures are transient. However, they work less well in settings where the 
set of processors in the network evolves over time, with processes joining and leaving the system. 
For such a setting, a dynamic notion of primary is needed. A dynamic notion of primary still 
needs to maintain some kind of intersection property, in order to permit enough information flow 
between successive primary views to achieve coherence. For example, each primary view might have 
to contain at least a majority of the processes in the previous primary view. Several dynamic voting 
schemes have been developed to define primaries adaptively, e.g. [28, 36, 55, 88, 71, 77]. 

In particular, Yeger Lotem, Keidar, and Dolev [88] have described an implementation of a group 
membership service that yields only primary views, according to a dynamic notion of primary. An 
interesting feature of their work is that it points out various subtleties of implementing such a 
membership service in a distributed manner - subtleties involving different opinions by different 
processes about what is the previous primary view. These difficulties have led to errors in some of 
the past work on dynamic voting. The algorithm of [88] copes with these subtleties by maintaining 
information about a collection of primary views that "might be" the previous primary view. The 
service deals with group membership only, and not with communication. Yeger Lotem et al. prove 
that their protocol satisfies the following condition on system executions: any two (primary) views 
that occur in an execution are linked by a chain of views where for every consecutive pair of views 
in the chain, there is some process that "knows" it belongs to both views. 

In Chapter 5 we provide a group communication service, called dvs, that integrates the vs 
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group communication service with a dynamic primary view membership service, yielding a dynamic 
primary view group communication service. The DVS service is inspired by the implementation 
of [88], but integrates communication with the group membership service. We also show how the 
dvs specification can be implemented and used for an application. 

In Chapter 6 we extend the notion of "primary view" to that of "primary configuration" . The 
main difficulty in making this step is to identify the intersection property between two successive 
primary configurations that allows to maintain consistency. We propose one possible such a property. 
Namely, we require that there be a quorum of the old primary configuration which is included in the 
membership set of the new primary configuration. This guarantees that there is at least one process 
in the new primary configuration that has the most up to date information. This, similarly to the 
intersection property of dynamic primary views, allows flow of information from the old configuration 
to the new one and thus permits to preserve consistency. 

We actually consider a more specialized version of configurations which uses two sets of quorums, 
a set of read quorums and a set of write quorums, with the property that any read quorum intersects 
any write quorum. (This choice is justified by the application we develop, an atomic read/write 
register.) With this kind of configuration the intersection property that we require for a new primary 
configuration is that there be one read quorum and one write quorum both of which are included 
in the membership set of the new primary configuration. The use of read and write quorums (as 
opposed to just quorums) can be more efficient in order to balance the load of the system (e.g., [37]). 

We provide a a group communication service, called DC, that integrates a group communication 
service with a dynamic primary configuration membership service. We prove that the DC service is 
implementable and can be used for applications. 

Finally, in Chapter 7, we investigate the use of the technique introduced to design dvs and DC 
to transform services and applications that are designed for "static" settings, into ones that work 
well in "dynamic" settings. Specifically, we design a service similar to DC and we use that service to 
provide a dynamic version of the Lamport's PAXOS algorithm [61]. The original algorithm is designed 
for system that are relatively static: if a majority (o more generally a quorum) of the processors is 
not available then the algorithm blocks. The dynamic version adapts well to permanent changes of 
the system. We also design a primary copy data replication algorithm; this algorithm is simlar to 
the Liskov-Oki algorithm [76] but considers dynamic settings, while the Liskov-Oki is designed for 
static settings. 
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Chapter 3 



Mathematical foundations 



In this chapter we introduce some terminology and notation, and then we provide the underlying for- 
mal model used to specify our group communication services and applications. Section 3.1 provides 
terminology and notation and Section 3.2 describes the IOA model. 

3.1 Notation and terminology 

3.1.1 Sets, functions, sequences 

We write A for the empty sequence. If a is a sequence then \a\ denotes the length of a. If a is 
a sequence and 1 < i < j < \a\ then a(i) denotes the ith element of a and a(i..j) denotes the 
subsequence a(i), a(i + 1), ..., a(j) of a. The head of a nonempty sequence a is o(l). A sequence can 
be used as a queue: the append operation modifies the sequence by concatenating it with a new 
element and the remove operation modifies the sequence by deleting its head. 

If a and 6 are sequences, a finite, then aob denotes the concatenation of a and b. We sometimes 
abuse this notation by letting a or b be a single element. We say that sequence a is a prefix of 
sequence b, written a < b, provided that there exists c such that aoc = b. A collection A of 
sequences is consistent provided that a < b or b < a for all a, b £ A. If A is a consistent collection 
of sequences, we define lub(A) to be the minimum sequence b such that a < b for all a e A. 

If S is a set, then seqof(S) denotes the set of all finite sequences of elements of S. If a € seqof(S) 
and / is a partial function from S to T whose domain includes the set of all elements of S appearing in 
a, then applytoall(f,a) denotes the sequence b such that length(b) = length(a) and, for i < length(b), 
6(0 = /(o(0). 

If S is a set, the notation S± refers to the set S U {-L}. Whenever S is ordered, we order S± 
by extending the order on 5, and making ± less than all elements of S. If R is a binary relation, 
then we define dom(R), the domain of R, to be the set (without repetitions), of first elements of the 
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ordered pairs comprising relation R. If / is a partial function from S to T, and (s,t) € S x T, then 
/ ffi {s, t) is denned to be the partial function that is identical to / except that f(s) = t. 

We denote by arrayof(S) the set of all arrays, indexed by positive integers, whose entries consists 
of elements of S± . 

3.1.2 Processors, views, configurations, identifiers 

V denotes the universe of all processors, 1 and M. the universe of all possible messages. Q is a 
totally ordered set of identifiers used to distinguish views or configurations, with a distinguished 
least element go . 

A view v = (g, P) consists of a view identifier g € Q and a nonempty membership set P CP; we 
write v.id and v. set to denote the view identifier and membership set components of v, respectively. 

V denotes the set of all views, and vo = {go, Po) is a distinguished initial view. 

The notion of view can be generalized to that of configuration. A configuration is a view with 
a structure defined on the view. For example a configuration can be a view with a set of quorums 
defined over the memebrship set of the view. However configurations can be tailored to applications. 
For example, applications that use read and write quorums, use configurations which are views with 
a set of read quorums and a set of write quorums; applications that use a "leader" processor use 
configurations with a leader processor. Next we define several types of configurations. We will 
specify the type of configuration we use in the chapter where we use it. 

A configuration is a triple, c = (g, P, Q), where g e Q is a unique identifier, P C V is a nonempty 
set of processors, and Q is a nonempty sets of nonempty subsets of P, such that any two subsets 
intersects. Each element of Q is called a quorum of c. 

A more specialized type of configuration is a quadruple, c = {<?, P, 1Z, W), where g e Q is a unique 
identifier, P C V is a nonempty set of processors, and V, and W are nonempty sets of nonempty 
subsets of P, such that Rr\W ^ for all R € 1Z, W € W. Each element of V, is called a read quorum 
of c, and each element of W a write quorum. We refer to this type of configuration as read-write 
quorum configuration. 

Another type of configuration is a quadruple, c = (<?, P, Q,p), where g e Q is a unique identifier, 
P C P is a nonempty set of processors, and Q is a quorum system and p G P is a distinguished 
processor, called the leader of the configuration. We refer to this type of configuration as leader 
configuration. 

Once fixed a particular type of configuration, welet C denote the set of all configurations. Given 
a configuration c, the notation c.id refers to the configuration identifier g, the notation c.set refers 
to the membership set P; the notation c.qrms refers to the quorum system Q while c.rqrms and 



'We use "processor" and "process" interchangeably. 
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c.wqrms refer to the read quorums set 1Z and the write quorum sets W, respectively; the notation 
c.ldr refers to the leader p of configuration c. 

We distinguish an initial configuration c$ = (go , Po , IZo , VVo ) (or cq = (go,Po,Qo)i or cq = 
(So, Po, Qo;Po); depending on the type of configuration that we are using), where g is a distinguished 
configuration identifier. 

3.2 The I/O automaton (IOA) model 

We describe our services and algorithms using the I/O automaton model of Lynch and Tuttle [67] 
(without fairness). 

The I/O automata (IOA for short) model is a formal model suitable for describing asynchronous 
distributed systems. The basic I/O automaton model was introduced by Lynch and Tuttle [67]. 

Various extensions of the basic model have been developed. For example two extensions provide 
formal mechanisms to handle the passage of time and thus are suitable for describing partially 
synchronous distributed systems; these models are the MMT automaton (MMTA for short) and the 
general timed automaton (GT automaton or GTA for short). The MMTA is a special case of GTA, 
and thus it can be regarded as a notation for describing some GT automata. 

For the purpose of this thesis, we will use this basic I/O automaton model, which we describe 
in Section 3.2.1. Section 3.2.2 describes the "composition" operation on automata. The interested 
reader can find more information about IOA in [65]. 

3.2.1 IOA definition 

An I/O automaton is a simple type of state machine in which transitions are associated with named 
actions. These actions are classified into categories, namely input, output, internal and, for the timed 
models, time-passage. Input and output actions are used for communication with the external envi- 
ronment, while internal actions are local to the automaton. The time-passage actions are intended 
to model the passage of time. The input actions are assumed not to be under the control of the 
automaton, that is, they are controlled by the external environment which can force the automaton 
to execute the input actions. Internal and output actions are controlled by the automaton. The 
time-passage actions are also controlled by the automaton (though this may at first seem somewhat 
strange, it is just a formal way of modeling the fact that the automaton must perform some action 
before some amount of time elapses). 

As an example, we can consider an I/O automaton that models the behavior of a process involved 
in a consensus problem. Figure 3-1 shows the typical interface (that is, input and output actions) 
of such an automaton. The automaton is drawn as a circle, input actions are depicted as incoming 
arrows and output actions as outcoming arrows (internal actions are hidden since they are local 
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Init(v) Decide(v) 




Send(m) Receive(m) 

Figure 3-1: An I/O automaton. 

to the automaton). The automaton receives inputs from the external world by means of action 
init(u), which represents the receipt of an input value v and conveys outputs by means of action 
decide(u) which represents a decision of v. Actions send(to) and RECEivE(m) are supposed to model 
the communication with other automata. 

A signature S is a triple consisting of three disjoint sets of actions: the input actions, in(S), 
the output actions, out(S), and the internal actions, int(S). The external actions, ext(S), are 
in(S)L)out(S); the locally controlled actions, local(S), are out(S)L)int(S); and acts(S) consists of all 
the actions of S. The external signature, extsig(S), is defined to be the signature (in(S), out(S),$). 
The external signature is also referred to as the external interface. 

An I/O automaton (IOA for short) A, consists of five components: 

• sig(A), a signature 

• states (A), a (not necessarily finite) set of states 

• start(A), a nonempty subset of states(A) known as the start states or initial states 

• trans (A), a state-transition relation, where trans (A) C states (A) x 
acts(sig{A)) x states(A); this must have the property that for every state s and every in- 
put action ir, there is a transition (s, n, s') G trans(A) 

• tasks(A), a task partition, which is an equivalence relation on local(sig(A)) having at most 
countably many equivalence classes 

Often acts(A) is used as shorthand for acts(sig(A)), and similarly in(A), and so on. 

An element (s,ir, s') of trans(A) is called a transition, or step, of A. If for a particular state s 
and action n, A has some transition of the form (s, n, s'), then we say that tv is enabled in s. Input 
actions are enabled in every state. 

The fifth component of the I/O automaton definition, the task partition tasks(A), should be 
thought of as an abstract description of "tasks," or "threads of control," within the automaton. This 
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partition is used to define fairness conditions on an execution of the automaton; roughly speaking, 
the fairness conditions say that the automaton must continue, during its execution, to give fair turns 
to each of its tasks. 

An execution fragment of A is either a finite sequence, So,7Ti, Si,7T2, . . . , 7r r ,s r , or an infinite 
sequence, So,iri, Si,7T2, . . . ,ir r ,s r , . . ., of alternating states and actions of A such that (sk^k+i^ s k+i) 
is a transition of A for every k > 0. Note that if the sequence is finite, it must end with a state. 
An execution fragment beginning with a start state is called an execution. The length of a finite 
execution fragment a = so,7Ti, s±, 7T2, . . . ,ir r , s r is r. The set of executions of A is denoted by 
execs(A). A state is said to be reachable in A if it is the final state of a finite execution of A. 

The trace of an execution a of A, denoted by trace(a), is the subsequence of a consisting of all 
the external actions. A trace f} of A is a trace /? of an execution of A. The set of traces of A is 
denoted by traces (A). 

3.2.2 Composition of IOA 

The composition operation allows an automaton representing a complex system to be constructed by 
composing automata representing simpler system components. The most important characteristic of 
the composition of automata is that properties of isolated system components still hold when those 
isolated components are composed with other components. The composition identifies actions with 
the same name in different component automata. When any component automaton performs a step 
involving it, so do all component automata that have n in their signatures. Since internal actions of 
an automaton A are intended to be unobservable by any other automaton B, automaton A cannot 
be composed with automaton B unless the internal actions of A are disjoint from the actions of B. 
(Otherwise, A's performance of an internal action could force B to take a step.) Moreover, A and 
B cannot be composed unless the sets of output actions of A and B are disjoint. (Otherwise two 
automata would have the control of an output action.) 

Let I be an arbitrary finite index set 2 . A finite countable collection {Si}i^i of signatures is said 
to be compatible if for all i,j £ I, i ^ j, the following hold 3 : 

1. int(Si) n acts(Sj) = 

2. out(Si) n out(Sj) = 

A finite collection of automata is said to be compatible if their signatures are compatible. 



2 The composition operation for IOA is defined also for an infinite but countable collection of automata [65], but 
we only consider the composition of a finite number of automata. 

3 We remark that for the composition of an infinite countable collection of automata, there is a third condition on 
the definition of compatible signature [65]. However this third condition is automatically satisfied when considering 
only finite sets of automata. 
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When we compose a collection of automata, output actions of the components become output 
actions of the composition, internal actions of the components become internal actions of the com- 
position, and actions that are inputs to some components but outputs of none become input actions 
of the composition. Formally, the composition S = Yiiei Si °f a finite compatible collection of 
signatures {Si}i e i is defined to be the signature with 



• out(S) = Ui e iout(Si) 

• int(S) = Lli£iint(Si) 



• in(S) = Ui e jtn(Si) - U ie iout(Si) 

The composition A = Yl ieI Ai of a finite collection of automata, is defined as follows: 4 

• sig(A) = H ieI sig(Ai) 

• states(A) = Yii^i states (Ai) 

• start(A) = Yiiei start (Ai) 

• trans(A) is the set of triples (s,n,s') such that, for alH € I, if n € acts(Ai), then (sj,7t, sj) € 
trans(Ai); otherwise Sj = s^ 

• tasks(A) = L)i£i tasks (Ai) 

Thus, the states and start states of the composition automaton are vectors of states and start 
states, respectively, of the component automata. The transitions of the composition are obtained by 
allowing all the component automata that have a particular action n in their signature to participate 
simultaneously in steps involving it, while all the other component automata do nothing. The 
task partition of the composition's locally controlled actions is formed by taking the union of the 
components' task partitions; that is, each equivalence class of each component automaton becomes an 
equivalence class of the composition. This means that the task structure of individual components is 
preserved when the components are composed. Notice that since the automata Ai are input-enabled, 
so is their composition. The following theorem follows from the definition of composition. 

Theorem 3.2.1 The composition of a compatible collection of I/O automata is an I/O automaton. 

The following theorems relate the executions and traces of a composition to those of the com- 
ponent automata. The first says that an execution or trace of a composition "projects" to yield 
executions or traces of the component automata. Given an execution, a = sq, 7Ti, s±, . . . , of A, let 



4 The II notation in the definition of start(A) and states(A) refers to the ordinary Cartesian product, while the II 
notation in the definition of sig(A) refers to the composition operation just defined, for signatures. Also, the notation 
Si denotes the ith component of the state vector s. 
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a | Ai be the sequence obtained by deleting each pair n r ,s r for which 7i> is not an action of Ai and 
replacing each remaining s r by (s r )i, that is, automaton Ai's piece of the state s r . Also, given a 
trace /3 of A (or, more generally, any sequence of actions), let fi\Ai be the subsequence of /3 consisting 
of all the actions of Ai in /3. Also, | represents the subsequence of a sequence /? of actions consisting 
of all the actions in a given set in /3. 

Theorem 3.2.2 Let {Ai}i^j be a compatible collection of automata and let A = Yliei ^-i- 

1. If a € execs(A), then a\Ai G execs(Ai) for every i € I. 

2. If P G traces(A), then P\Ai G traces(Ai) for every i G I. 

The other two are converses of Theorem 3.2.2. The next theorem says that, under certain 
conditions, executions of component automata can be "pasted together" to form an execution of the 
composition. 

Theorem 3.2.3 Let {Ai}i^j be a compatible collection of automata and let A = Yliei Ai- Suppose 
ai is an execution of Ai for every i e I, and suppose ft is a sequence of actions in ext(A) such that 
P\Ai = trace(a.i) for every i € I. Then there is an execution a of A such that ft = trace(a) and 
ai = a\Ai for every i € I. 

The final theorem says that traces of component automata can also be pasted together to form 
a trace of the composition. 

Theorem 3.2.4 Let {Ai}i e j be a compatible collection of automata and let A = Yli^i -^t- Suppose 
j3 is a sequence of actions in ext(A). If fi\Ai G traces(Ai) for every i € I, then ft G traces(A). 

Theorem 3.2.4 implies that in order to show that a sequence is a trace of a system, it is enough 
to show that its projection on each individual system component is a trace of that component. 



31 



Chapter 4 



The vs service 



In this chapter we describe the view-oriented group communication service vs introduced in [41]. The 
vs service deals with views. We describe "variations" of the service, which deal with configurations. 

4.1 The VS service 

The VS service is a view-oriented group communication service. The name VS stands for "view 
synchrony" and refers to the property that a message sent within a particular view is delivered 
only to members of that view and only if they are still in that view. This seems to be the most 
important property of group communication services that go under the label of "virtual synchrony" 
(the expression "virtual synchrony" has been actually semantically overloaded and several virtual 
synchronous services provide different guarantees). 

Another important feature of the vs specification is that it requires that the sequence of messages 
received by two different processes within a given view are such that one is the prefix of the other. 
Finally, new views are reported to their members in order of view identifier. 

The external actions of the vs specification include vs-gpsnd(to) p actions, representing the client 
at p sending a message m, and vs-GPRcv(m) p , q actions, representing the delivery to q of the message 
m sent by p. Output actions vs-sAFE(m) p , q are also provided at q to report that the earlier message 
m from p has been delivered to all locations in the current view as known by q. 

The VS service informs its clients of group status changes through vs-newview({s,P)) p actions 
(with p e P), which tells p that the view identifier g is associated with membership set P and 
that, until another vs-newview occurs, the following messages will be in this view. After any finite 
execution, the current view at p is defined as the argument v in the last newview(^) p event, if any, 
otherwise it is either the initial view {g , Po) if P £ Po, or -L if P 4- Po- This reflects the concept that 
the system starts with the processors in P forming the group, and other processors unaware of the 
group. 
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vs 

Signature: 

Input: YS-GTSNB(m) p , m e M, p eT 

Internal: vs-createview(^), teV 

vs-ORDER(m,p,s), m £ M, p £V, g £ Q 

State: 

created £ 2 V , init {^0} 
for each p £ V: 

current-viewid\p] £ Q±, init go if p £ Po, -L else 
for each g E Q: 

queue[g] £ seqof(M x 7^), init A 

Transitions: 

internal VS-CREATEVIEW(^) 
Pre: Vw £ created : v.id > w.id 
Eff: created := created U{v} 

output VS-NEWVIEW(^)p 

Pre: ^ £ created 

v.id > current-viewid\p] 
Eff: current- viewid\p] := v.id 

input vs-GPSND(m)j, 
Eff: if current-viewid\p] ^ _l_ then 

append m to pending\p,current-viewid\p]] 



Output: vs-GPRCv(m) p ,q, m £ M, p,q £ 7? 
VS-SAFE(m) p ,q, m £ X, p,q eV, 
vs-newview(d) p , v £ V, p £ u.sei 

for each p EV, g £ Q: 

pending\p,g] £ seqof(M), init A 
neri[p,g] £ N >0 , init 1 
next-safe\p,g\ £ N >0 , init 1 



internal vs-ORDER(m,p,g) 
Pre: m is head of pending\p,g] 
Eff: remove head of pending \p,g] 
append (m,p) to queue[g] 

output vs-GPRCv(m)p,q, choose g 

Pre: g ^ _L 

3 = current-viewid[q] 
queue[g]{next[q,g]) = (m,p) 

Eff: next[q, g] := next[q,g] +1 

output vs-SAFE(m)p,q, choose </,P 
Pre: 3 / _L 

3 = current-viewid[q] 
{g,P) £ created 

queue[g](next-safe[q,g]) = {m,p) 
for all r £ P: 
next[r, g] > next-safe[q,g] 
Eff: next-safe[q, g] := next-safe[q, g] +1 



Figure 4-1: The VS service 
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The code is given in Figure 4-1. 

The state of the vs service keeps track of the created views in variable created, and for each 
processor p, it keeps track of the current view at p, in variable current-view\p\. For each view, 
incoming message from a client at p are first buffered into a queue for processor p, pending\p, g], 
and then they are "ordered" into a global queue for the view, queue[g\. The pointers next\p,g\ and 
next-safe\p, g] point to, respectively, the next message of the gloabal queue that has to be delivered 
to the client at p and the next safe indications that has to be delivered to the client at p. In any 
trace of the vs service, there is a natural correspondence between vs-gprcv events and the vs-gpsnd 
events that cause them, and between vs-safe events and the vs-gpsnd events that cause them. 

The actions for creating a view and for informing a processor of a new view are straightforward 
(recall that the signature ensures that only members, but not necessarily all members, receive 
notification of a new view). 

A message that is sent before the sender knows of any view (when the current view is ±) is 
simply ignored, and never delivered anywhere. 

Note that VS specification does not include any restrictions on when a new view might be formed. 
Clearly it is possible to analyze the service conditionally to some restrictions on the execution. Fir 
example, the performance and fault-tolerance property analysis provided in [41], does consider some 
restrictions: it implies that "capricious" view changes must stop shortly after the behavior of the 
underlying physical system stabilizes. 

We note that the fact that vs allows views to be created only in order of view identifier is 
unimportant: weakening this requirement to allow out-of-order view creation would not change the 
external behavior, because vs-newview actions are constrained to occur in such a way that views are 
delivered in order of view indentifiers anyway. 

The following are safety properties of the vs service which we will be using in Chapter 5. 

• New views are reported in increasing order of view identifier (Monotone views property) ; 

• Messages sent in a view are delivered only within that view (View synchrony property) ; 

• The sequences of messages delivered in a view at any two processors are such that one sequence 
is a prefix of the other (Prefix order property) . 

The following invariant holds. 

Invariant 4.1.1 (vs) 

In any reachable state, if v,v' G created and v. id = v 1 .id, then v = v 1 . 
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4.2 Variations: The CS and LC services 

In many applications involving shared data, updates to the shared data have to be agreed upon by 
all the members of a view. In order to improve availability of the service and to balance the load of 
the system it is desirable to make updates without involving all the members of a view while still 
preserving consistency. This is achieved by using configurations. 

A configuration is different from an ordinary view in that it is an ordinary view equipped with a 
set of subsets of the members of the view which satisfy the property that any two sets intersect. Such 
sets are called quorums. Hence a configuration is a view plus a set of quorums. The intersection 
property of quorums guarantees consistency within a given configuration: indeed for any given 
quorum there is always at least one process that has the latest update. 

We will consider two more specialized types of configurations, which have been introduced in 
Chapter 3. 

On such a configuration is the read-write-quorum configuration. Recalling the definition from 
Chapter 3, we have that a configuration is c = {g,P, 1Z, W), where g is a configuration identifier, P 
is the set of members and 1Z and W are the sets of read and write quorums. 

Another such a configuration is the leader configuration. Recalling the definition from Chapter 3, 
we have that, in this case, a configuration is c = {g,P,Q,p), where g is a configuration identifier, P 
is the set of members and Q is the sets of quorums and p is the leader of the configuration. 

The VS service supports ordinary views v = {g, P) but can be easily generalized to configurations. 
We call these generalizations CS and LC, respectively, for the read-write-quorums configurations and 
for the leader configurations. The only difference between CS and vs, as well as LC and vs, is that 
CS and LC announce configurations while VS announces ordinary views. The code of CS, as well as 
that of LC, is exactly the code of vs. Indeed configurations are treated as a single entity, as are 
ordinary views. The reason we "rename" the code is because the two services are actually different 
(one provides views and the other provides configurations), thus we must distinguish them. 

Figure 4-2 shows the code of the CS and LC specifications. Since these codes are the same as vs all 
the properties and invariants of VS apply to CS and LC too. In particular we have that configurations 
are reported in increasing order of configuration identifier (Monotone configurations property), mes- 
sages sent in a configuration are delivered only within that configuration (Configuration synchrony 
property), and the sequence of messages delivered in a configuration at any two processes are such 
that one is a prefix of the other. 
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cs and LC 

Signature: 

Input: CS-GPSND(m) p , m £ M, p £ V 

Internal: cs-createconf(c), c £ C 

cs-ORBEn(m,p,g), m £ M, p eV, g £ Q 
State: 

created £ 2 C , init {co} 
for each p £ V: 

current- con fid\p] £ 5j_, init go if P £ -Po: -L else 
for each g £ Q: 

queue[g] £ seqof(A4 x 7^), init A 

Transitions: 

internal cs-CREATECONF(c) 
Pre: Vw £ created : c.id > w.id 
Eff: created := created U{c} 

output cs-newconf(c) p 
Pre: c £ created 

c.id > current- confid\p] 
Eff: current- confid\p] := c.id 

input cs-gpsnd(to)j, 
Eff: if current- confid\p] / _l_ then 

append m to pending\p, current- confid\p]] 

internal CS-ORDER(m,p, g) 
Pre: m is head of pending\p,g] 
Eff: remove head of pending\p,g] 
append (m,p) to queue[g] 



Output: CS-GPRCv(m) p ,q, m £ M, p,q £ V 
CS-SAFE(m) Pj q, m £ M, p,q £ "P, 
cs-newconf(c)p, c £ C, p £ c.set 

for each p £ V , g £ 6: 

pending \p,g] £ seqof(M), init A 
ne:rf[p,s] £ N >0 , init 1 
next-sa{e\p,g\ £ N >0 , init 1 



output cs-GPRCv(m)p,q, choose g 
Pre: g = current- confid[q] 

queue[s](neri[g,g]) = (m,p) 
Eff: next[q,g] := ne:rf[g,<7] + 1 

output cs-SAFE(m)p,q , choose g,P 
Pre: g = current- confid[q] 

(g,F) £ created 

queue[g](next-safe[q, g]) = {m,p 
for all r £ P: 
next[r,g] > next-safe[q,g] 
Eff: ne:rf-sa/e[<j, g] := nexi-safe[q, g] 



Figure 4-2: The CS service 
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Chapter 5 



The dvs service 



In this chapter we present DVS, a specification for a dynamic primary view group communication 
service. Section 5.1 provides the dvs specification, Section 5.2 provides an implementation of dvs 
and finally Section 5.3 describes an application that uses dvs as building block. Section 5.4 closes 
the chapter with some remarks. 

5.1 The dvs specification 

The dvs service works as follows. Each client of the service has a "current" view of the group of 
processes. A process can send a message to all other members of its current view and the service 
guarantees that messages sent within a view are delivered only within that view and each member of 
the view receives messages in the same order as other members. However, not all messages need to 
be delivered to all members. The service also provides a "safe" notification for a particular message 
m that tells the recipient that message m has been received by all the members of the current view. 
New views are announced to all members of the new view and new views are guaranteed to be 
"primary" views. Primary views are defined according to a dynamic notion [55]: a new primary 
needs to contain a majority of the members of the previous primary. The DVS service allows the 
clients to "register" a new view after completing the pre-processing for that view. 

The specification is given in Figure 5-1. In this specification, M c Q M denotes the set of messages 
that clients may use for communication. The most interesting part of the DVS specification is the 
transition definition for dvs-createview(^). The precondition specifies the properties that a view 
must satisfy in order to be considered primary. The precondition says that v. set must intersect the 
membership set of all previously-created smaller-id views w for which there is no intervening totally 
registered view - that is, the set of all "possible previous primary views" . Since (for convenience) we 
allow out-of-order view creation in DVS, we also include a symmetric condition for previously-created 
larger-id views. All created views are recorded in created. 
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DVS 

Signature: 

Input: DVS-GPSND(m) p , m £ M c , p £ V 

DVS-REGISTERp, p £ V 

Internal: dvs-createview(^), v £ V 

Dvs-ORDER(m,p,g), m £ M c , p £V, g £ Q 

State: 

created £ 2 V , init {vo} 

for each p £ V: 

current-viewid\p] £ Q±, init go if P £ -Fb: -L else 

for each g £ Q: 

queue[g] £ seqof(M c x "P), init A 
attempted[g] £ 2^, init Po if 9 = 90, else 
registered [g] £ 2 P , init Po if g = go, else 

Derived variables: 



Output: DVS-GPRCV(m) p ,q, m £ M c , P,q£V 
DVS-SAFE(m)p,q, m £ Mc, P,q£V 

dvs-newview(u)p, v £ V , p £ u.sei 



for each p £?, g £ Q: 

pending\p,g] £ seqof(M c ), init A 
neit[p,9] £ N >0 , init 1 
next-safe\p,g] £ N >0 , init 1 



Att £ 2 V , defined as {v £ created \ attempted[v .id] / 0} 
TbtAtt £ 2 V , defined as {ij £ created | u.sei C attempted[v .id]} 
IZeg £ 2 V , defined as {u £ created \ registered[v .id] / 0} 
IbtReg £ 2 V , defined as {^ £ created \ v. set C registered^. id]} 



Transitions: 

internal dvs-CREATEVIEw(d) 
Pre: Vu; £ created : v.id ^ w.id 
Vw £ created : 
Bx £ TotReg : w.id < x.id 
or 3x £ TotReg : v.id < x. 
or v.set n w.set ^ 
Eff: created := created U {^} 



< «.id 
id < «;.• 



output DVS-NEWVIEW(^)p 

Pre: v £ created 

v.id > current-viewid\p] 
Eff: current- viewid\p] := v.id 

attempted[v.id] := attempted[v.id] U {p} 

input DVS-REGISTERp 

Eff: if current-viewid\p] ^ _l_ then 

registered[current-viewid\p]] := 
registered[current-viewid\p]] U {p} 



internal DVS-ORDER(m,p,g) 
Pre: m is head of pending\p,g] 
Eff: remove head of pending\p, g] 
append {m,p) to queue[g] 

output DVS-GPRCv(m)p,q, choose g 
Pre: g = current-viewid[q] 

queue[g](next[q,g]) = (m,p) 
Eff: next[q, g] := next[q, g] +1 

output DVS-SAFE(m)p,q, choose g,F 
Pre: g = current-viewid[q] 
{g,P) £ created 

queue[g](next-sa{e[q,g]) = (m,p) 
for all r E P. 
next[r, g] > next-safe[q,g] 
Eff: next-safe[q, g] := next- safe[q, g] +1 



input DVS-GPSND(m)p 
Eff: if cwrent-wewrtd[p] ^ _l_ then 

append m to pending\p,current-viewid\p]] 



Figure 5-1: The DVS service. 
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The DVS service informs its clients of view changes using Dvs-NEwviEw({g,P)) p actions; such an 
action informs processor p that the view identifier g is associated with membership set P and that 
the current group of processors connected to p is P. After any finite execution, we define the current 
view at p to be the argument v in the last dvs-newview(^) p event, if any, otherwise it is the initial view 
v for processors in P and is undefined for other processors. Even though views can be created out 
of view identifier order, the notification to each client is consistent with that order. Not every client 
needs to see every view. The variable attempted records, for each view, which processes have been 
notified of that view. Variable attempted is only used in proving the correctness of an implementation 
of DVS. 

With the dvs-registerp action, the client at p informs the service that it has obtained whatever 
information the application needs to begin operating in the new view v. For many applications, this 
will mean that p has received messages from every other member of view v, reporting its state at the 
start of v. The variable registered records, for each view, which process have registered that view. 
Variable registered is only used in proving the correctness of an implementation of DVS. 

The dvs service allows a processor p to broadcast a message m using a Dvs-GPSND(m) p action, and 
delivers the message to a processor q using a Dvs-GPRcv(m) p ,q action, dvs also uses a Dvs-sAFE(m) p ,q 
action to report to processor q that the earlier message m from p has been delivered to all members 
of the current view of q. DVS guarantees that messages sent by a processor p when the current view 
of p is v are delivered only within view v (i.e., only to processors in v. set whose current view is v). 
Moreover, each processor receives messages in the same order as other processor and without gaps 
in the sequence of received messages; however, a processor may receive only a prefix of the sequence 
of messages received by another processor. Variables queue, pending, next and next-safe are used for 
handling the messages. Their use should be clear from the code. 

There are four derived variables, Att, TbtAtt, IZeg and TbtTZsg. Informally, a view belongs to 
the set Att if it has been reported to at least one member of the view (we say that it is attempted). 
A view belongs to the set TbtAtt if it has been reported to all members of the view (we say that 
the view is totally attempted). Similarly, a view belongs to the set IZeg if at least one member of 
the view has registered the view (we say that it is registered) and belongs to the set TbtTZeg, if all 
members of the view have registered the view (we say that the view is totally registered). 

We close this section with some invariants stating properties of dvs. 

Invariant 5.1.1 (dvs) 

In any reachable state, TbtAtt C Att, TbtTZeg C TZeg, TZeg C Att, and TbtTZeg C TbtAtt. 

Invariant 5.1.2 (dvs) 

In any reachable state if, p G attempted[g] then current-viewid\p] > g. 
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Invariant 5.1.3 expresses the key intersection property guaranteed by DVS; this is weaker than 
the intersection property required by static definitions of primary views, which says that all primary 
components must intersect. This invariant is our version of the correctness requirement for dynamic 
view services that two consecutive primary views intersect. 

Invariant 5.1.3 (dvs) 

In any reachable state, ifv,w G created, v.id < w.id, and there is no x € TbtTZeg such that v.id < 

x.id < w.id, then v. set fl w.set 7^ 0. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state created = {vo} and thus the invariant is 
vacuously true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s' 
for any possible step (s, n, s'). The only steps that can change the hypothesis from false to true are 
dvs-createview(^) and dvs-createview(«;). The preconditions of these actions show that the needed 
conclusion holds. No step changes the conclusion from true to false. □ 

Invariant 5.1.4 says that if a view w is totally attempted, then any earlier view v has a member 
whose current view is later than v. 

Invariant 5.1.4 (dvs) 

In any reachable state, if v G created, w G TotAtt, and v.id < w.id, then there exists p G v. set with 

current-viewid\p] > v.id. 

Proof: Consider any particular reachable state. Assume that v G created, w G TotAtt, and v.id < 
w.id. Then let y be the view in TotAtt having the smallest viewid strictly greater than v.id. Then 
there is no x G TotAtt with v.id < x.id < y.id. Then Invariant 5.1.1 implies that there is no 
x G TbtKeg with v.id < x.id < y.id. Then Invariant 5.1.3 implies that v. set H y.set 7^ 0. Let 
p G v.setHy.set; thenp G attempted[y .id]. Then Invariant 5.1.2 implies that current-viewid\p] > y.id. 
This implies current-viewid\p] > v.id. □ 



5.2 An implementation of dvs 

We now give an implementation of the dvs service which we call dvs-impl. In Section 5.2.1 we de- 
scribe DVS-IMPL, in Section 5.2.2 we provide some invariants of DVS-IMPL and finally in Section 5.2.3 
we prove that dvs-impl implements dvs, in the sense of inclusion of sets of traces. 
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5.2.1 Overview 

The implementation uses as a building block the group communication service vs (see Chapter 4) 
and uses ideas from [88]. The overall system is the composition of an automaton vs-to-dvs p for 
each p € V, and vs, with the external actions of vs hidden in the composition. This system is called 
DVS-IMPL and is illustrated in Figure 5-2. 




Figure 5-2: The DVS-IMPL system. 

The automaton vs-to-dvs p is given in Figure 5-3. vs-to-dvs p uses special non-client messages, 
tagged either with "info" 'or "registered". Thus,weuse M = M c U{{ u info v xVx2 v )}\j{ u registered v }, 
where M c is the set of all client messages and M. is the universe of all messages. The attempted, reg, 
and info-sent state variables are not needed for the algorithm, but only for the correctness proof. 

Automaton VS-TO-DVS p acts as a "filter", receiving vs-newview inputs from the underlying VS 
service and deciding whether to accept the proposed views as primary views. If vs-to-dvs p decides 
to accept some such view v, it "attempts" the view by performing a dvs-newview(^) output. For 
each v, we think of the DVS internal dvs-createview(^) action as occurring at the time of the first 
dvs-newview(^) event. 

According to the dvs specification, the algorithm is supposed to guarantee nonempty intersec- 
tion of each newly-created primary view v with any previously-created view w having no intervening 
totally registered view - a global condition involving nonempty intersection. The VS-TO-DVS p pro- 
cessors, however, do not have accurate knowledge of which primary views have been created by other 
processors, nor of which views are totally registered. Therefore, the processors employ a local check 
of majority intersection with known views, rather than a global check of nonempty intersection with 
existing views. Specifically, each vs-to-dvs p keeps track of an "active" view act, which is the latest 
view that it knows to be totally registered, plus a set of "ambiguous" views amb, which are all the 
views that it knows have been attempted (i.e., have had a dvs-newview action performed someplace), 
and whose ids are greater than act. id. We define use = {act} U amb. When vs-to-dvs p receives a 
vs-newview(u) input, it sends out "info" messages containing its current act and amb values to all the 
other processors in the new view, using the VS service, and then waits to receive corresponding "info" 
messages for view v from all the other processors in the view. After receiving this information (and 
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VS-TO-DVS 



Signature: 

Input: DVS-GPSND(m) p , m £ M c 

DVS-REGISTERp 

VS-NEWVIEW(?>)p, V £ V, V £ v.set 
vs-GPRCv(m)q, p , m £ .M, ge? 
vs-safe(to) 9j p, m £ .M, g£? 

State: 

cur d Vj_ , init ^o if P £ Po , -L else 

client- cur £ Vj_, init ^o if P £ Pd: -L else 

ac£ £ V, init ^o 

amb £ 2 V , init 

attempted £ 2 V , init {^0} if V £ Po>0 else 

for each g E G, q E V 

info-rcvd[q, g] £ (V x 2 v )j_, init _L 
rcvd-rgst[q, g] a bool, init false 



Internal: dvs-garbage-collect(^)p, jCV 

Output: vs-GPSND(m) f ,meM 

DVS-NEWVIEW(^)p, i)£V,pe v.set 
Dvs-GPRCv(m)q,p, m £ .M c , q £ f 

DVS-SAFE(m)q,p, m £ A(c, 9 £ T 7 



for each g a Q 

msgs-to-vs[g] £ seqof(M), init A 

msgs-from-vs[g] £ seqof{ Ai c x P), init A 

safe-from-vs[g] £ sego/(-M c x P), init A 

reg[g] a bool, init true if p £ Po and g = go: false else 

info-sent[g] £ (V x 2 v )j_, init _L 



Derived variables: 

.4tt £ 2 V , defined as .4tt = {v £ created | (3p £ v.set)v £ attempted p }; 
IZeg £ 2 V , defined as Keg = {u £ created j (3p £ u.sei)reg[u.ja!]p = true}; 
75t.4tt £ 2 V , defined as TbtAtt = {v £ created | (Vp £ v.set)v £ attempted p }\ 
IbtReg £ 2 V , defined as IbtReg = {v £ created | (Vp £ «.se£)reg[i).ja!]p = true}, 
use £ 2 V , defined as use = {ac£} U ami 



Transitions: 

input VS-NEWVIEW(^)p 

Eff: cur := v 

append {"info" , act, amb) to 

msgs-to-vs[cur -id] 
info-sent[cur.id] := {act, amb) 

input vs-gprcv({ "m/o" ,v,V)) QtP 
Eff: info-rcvd[q, cur. id] := {v,V) 
if «.id > act.id then aci := « 
am6 := {w £ am6 U V \ w.id > act.id} 

input vs-safe(("wj/o" ,v,V)) q , p 
Eff: none 

output DVS-NEWVIEW(^)p 

Pre: v = cur 

v.id > client-cur .id 

\/q £ v.set, q ^ p : info-rcvd[q,v.id] ^ _l_ 
Viu £ use : \v.settlw.set\ > |w.set|/2 
Eff: amb := amb U {v} 

attempted := attempted U {^} 
client-cur := « 

input DVS-REGISTERp 

Eff: if client-cur / _l_ then 

reg[cKent-cw] := true 
append {"registered") to 
TOsgs-io-«s[c/ieni-cur.«rf] 

input vs-gprcv({ "registered")) q , v 
Eff: rc«d-rgs£[c«r. id, q] := true 



input vs-safe({ "registered")) q, p 
Eff: none 

internal DVS-GARBAGE-COLLECT(?j)p 

Pre: Vg £ u.sei : rcvd-rgst[q,v.id] = true 

^.id > act.id 
Eff: ac£ := « 

amb := {w £ aro6 | w.id > act.id} 

input DVS-GPSND(m)p 
Eff: if client- cur. id p ^ _l_ then 

append m to msgs-to-vs[client-cur.id] 

output VS-GPSND(m)p 

Pre: m is head of msgs-to-vs[cur.id] 
Eff: remove head of msgs-to-vs[cur.id] 

input vs-GPRCv(m)q,p, where m £ M c 
Eff: append {m,q) to msgs-from-vs[cur.id] 

output DVS-GPRCV(m)q,p 

Pre: {m,q) is head of msgs-from-vs[client-cur.i 
Eff: remove head of msgs-from-vs[client-cur.id 

input vs-SAFE(m)q j p, where m £ M c 
Eff: append {m,q) to safe-from-vs[cur.id] 

output DVS-SAFE(m)p 

Pre: {m,q) is head of safe-from-vs[client-cur.ii 
Eff: remove head of safe-from-vs[client-cur.id 



Figure 5-3: The vs-to-dvs p code. 
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updating its own act and amb accordingly), VS-TO-DVS p checks that v has a majority intersection 
with each view in use. If so, VS-TO-DVS p performs a dvs-newview p output. 

Then the clients can use the communication system to exchange state information as needed for 
processing in view v. When client at p has obtained enough information, it "registers" the view by 
means of action dvs-register p , which causes processor p to send "registered" messages to the other 
members. When a processor receives "registered" messages for a view v from all members, it may 
perform garbage collection by discarding information about views with ids smaller than that of v. 
vs-to-dvs uses vs to send and receive messages. 

The system dvs-impl is defined as composition of all the vs-to-dvs p automata and vs with all 
the external actions of vs hidden. 

There are four derived variables for dvs-impl analogous to those of dvs, indicating the at- 
tempted, totally attempted, registered, and totally registered views, respectively. Another derived 
variable, use p is defined in the code. 

5.2.2 Invariants of dvs-impl 

This section contains invariants of dvs-impl needed for the proof that dvs-impl implements dvs in 
Section 5.2.3. The first invariants state simple facts about dvs. 

Invariant 5.2.1 (dvs-impl) 

In any reachable state, if cur p ^ ± then current-viewid\p] = cur.id p . 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p. In the initial state we have that cur p = ±. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s' 
for any possible step (s,ir, s'). Fix p. We prove the invariant considering each possible action n. 

1. 7T = VS-NEWVIEW(^)p. 

By the code of n in VS, we have that current-viewid\p] = v.id. By the code of n in DVS-IMPL, 
we have that cur.id p = v.id. 

2. Other actions. 

Variables current-viewid\p] and cur.id p are not modified. Hence the assertion cannot be made 
false. 

D 

Invariant 5.2.2 (dvs-impl) 

In any reachable state, if v £ attempted then client- cur. id p > v.id. 
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Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix v,p. In the initial state we have that attempted = {vo} for 
p € Pq and attempted = ± for p £ Po- So assume that v = vo and p € Po- Then client-cur p = vo. 
Hence the invariant is true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s' for 
any possible step (s,n,s'). Fix v,p and assume that v € s' .attempted . We distinguish two possible 
cases. 

1. v € s. attempted . 

By the inductive hypothesis we have that s. client-cur p > v. id. By the monotonicity of 
client-cur p we have that s' .client-cur p > s. client- cur p . 

2. v £ s. attempted . 

Then it must be 7r =dvs-newview(«) p . The invariant follows from the code which sets client-cur p 
to v. 

D 

Invariant 5.2.3 (dvs-IMPl) 

In any reachable state, if v e info-sent[g] p = (x,X) then cur.id p > g. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix v,p. In the initial state we have that info-sent = ± and 
thus the invariant is vacuously true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s' for 
any possible step (s,n,s'). Fix p, g, x, X and assume that s' .info-sent[g] p = {x,X). We distinguish 
two possible cases. 

1. s.info-sent[g] p = {x,X) 

By the inductive hypothesis we have that s.cur p > g. By the monotonicity of cur p we have 
that s' .cur p > s.cur p . Hence the invariant is true. 

2. s.info-sent[g] p ^ {x,X) 

Then it must be -k =vs-newview(^) p and g = v.id = s'.act.id p . Action vs-newview(^) p sets s 1 .cur 
to v, so s' . cur. id = g. 

u 

Invariant 5.2.4 (dvs-IMPl) 
In any reachable state: 
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1- vo G IbfReg. 

2. go < v.id for all v G created. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Part 1 is true because in then initial state every processor 
p € Po has reg[go] = true. Part 2 is true because the only view in created is vo- 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s' 
for any possible step (s,n,s'). 

Consider Part 1 first. No view is ever removed from TbtTZeg. Hence no step can make the assertion 
false. Consider Part 2 now. Fix v and assume that v G s' .created. We distinguish two cases. 

1. v G s. created. 

Then the assertion follows from the inductive hypothesis. 

2. v £ s. created. 

It must be 7t=vs-createview(^) p . By the precondition of this action we have that v.id > w.id 
for all w G s. created. By the inductive hypothesis go < w.id for all w 6 s. created. Since 
s' .created = s. created U {i>}, it follows that go < w.id for all w G s'. created. 

U 

Invariant 5.2.5 (dvs-IMPl) 

In any reachable state, if rcvd-rgst[q,v.id] p ^ ± then cur p ^ ±. 

Proof: By induction on the length of the execution. The base case consists of proving that the invari- 
ant is true in the initial state. Fix p, q and v. In the initial state we have that rcvd-rgst[q, v.id] p = ±. 
Hence the invariant is vacuously true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s' 
for any possible step (s,ir,s'). Fix p,q,v. We prove the invariant considering each possible action 
7r. Assume that s' .rcvd-rgst[q, v.id] p ^ ±. 

1. 7T = VS-NEWVIEW(^)p. 

Since s' .cur p = v we have that s' .cur p ^ ± (vs cannot deliver _L, it is not a view). 

2. 7T = vs-gprcv (("registered" ))p, q . 

By the precondition of n (see vs) we have that s.current-viewid\p] ^ ±. By Invariant 5.2.1 we 
have s.cur.id p = s.current-viewid\p] ^ ±. Hence s' .cur.id p = s.cur.id p ^ ±. 
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3. Other actions. 

Variables rcvd-rgst[q,v.id\ p and cur p are not modified. Hence the assertion cannot be made 
false. 

D 

Invariant 5.2.6 (dvs-IMPl) 

In any reachable state, if cur.id p = ± then act p = vo and amb p = 0. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p. In the initial state we have that act p = Vq and amb p = 0. 
For the inductive step assume the invariant is true in s. We need to prove that it is true in s' 
for any possible step (s,n,s'). Fix p. We prove the invariant considering each possible action 7r. 
Assume that s' .cur p = ±. Since no actions sets cur p to ± it must be s.cur p = ±. 

1. 7T = VS-GPRCV (("info", V,V)) Py q . 

This cannot happen. Indeed by precondition of n (see vs) we have that s.current-viewid\p] ^ ±. 
By Invariant 5.2.1 we have s.cur.id p = s.VS.current-viewid\p] Hence s'.cur.id p = s.cur.id p ^ 
±. But we know that s' ' .cur. id = ±. 

2. IT = DVS-NEWVIEW(^)^. 

Cannot happen. Indeed the precondition of n says that v = s.cur p . Since s.cur.id = ±, we 
have v = _L. Thus the precondition v.id > client- cur. id p cannot be satisfied (_L cannot be 
strictly greater than any view identifier). 

3. 7T = DVS-GARBAGE-COLLECT(^). 

Cannot happen. Indeed by Invariant 5.2.5 we have that s.cur p ^ ±. But we know that 
s.cur p = ±. 

4. Other actions. 

Variables cur p , act p and amb p are not modified. Hence the assertion cannot be made false. 

D 

The following invariant states that if an "info" message is in transit for view v or has been 
received by some process q in view v then there exists a process p that has sent the "info" in view 
v and such that its current view is either «ora later one. 

Invariant 5.2.7 (dvs-IMPl) 

In any reachable state, let C be the following condition: 
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("info" ,x,X) € msgs-to-vs[g] p or ("info" ,x,X) € pending\p,g\ or (("info" ,x,X),p) £ 
queue[g] or info-rcvd\p, g\ q = (x,X). 

If C is true then info-sent[g] p = (x,X) and cur.id p > g. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p,q,g,x and X. In the initial state msgs-to-vs[g] p = A, 
pending\p,g\ = A, queue[g] = A and info-rcvd\p, g\ q = _L. Hence, in the initial state, C is false and 
the invariant is vacuously true. 

For the inductive step assume that the invariant is true in a reachable state s. We need to prove 
that it is true in state s' for any possible step (s,ir,s') of the execution. Fix p,q,g,x, and X and 
assume that C is true in s'. 

1. 7T = VS-NEWVIEW(«)p. 

By the code of n, s'.cur p = v. Assume v. id ^ g. Then the code of n shows that none of 
msgs-to-vs[g\ p , pending\p, g], queue[g] or info-rcvd\p, g\ q is changed during this step. Thus C 
is true also in s. By the inductive hypothesis we have s.info-sent[g] p = (x, X) and cur.id p > g. 
Since we are considering the case v.id 7^ g, we have that info-sent[g] p is not changed by n. 
Moreover the precondition of n (see vs) shows that s' .current-viewid\p\ > s .current-viewid\p\. 
By Invariant 5.2.1, cur.id p = current-viewid\p\, so s'. cur.idp > s.cur.id p . This completes 
showing the conclusion for the situation w.id 7^ g. 

Assume now v.id = g. The code shows s 1 . cur.idp = g as required. It remains to show that 
(x,X) £ info-sent[g] p . 

Action 7r does not alter the values of pending\p, g], queue[g] and info-rcvd\p, g\ q and ap- 
pends ("info" , s.act p , s.amb p ) to msgs-to-vs[g] p . We claim that it must be x = s.act p and 
X = s.ambp. Indeed if it is not so, then condition C is true also in state s (for the given 
p, q, g, x, X) and by the inductive hypothesis we have s.cur.id p > g = w.id. By Invariant 5.2.1, 
s.current-viewid\p] > w.id. But this contradicts the precondition of -k (see vs). 
Thus x = s.act p and X = s.amb p . Then the code of n shows that (x,X) € info-sent[g] p , as 
required. 

2. 7T = VS-GPRCV«"tn/o",U, V))p, 9 . 

If g ^ cur.id q then since C is true in s' it is true also in s (for the given p,q,g,x,X). Thus 
the inductive hypothesis is true. Since the code does not change info-sent[g] p and cur.idp, the 
invariant follows from the inductive hypothesis. 

Hence assume that g = cur.id q . First consider the case x = v and X = V. In this case, by 
the precondition of n (see vs) we have that (("info" ,x,X),p) G queue[g\. Then the invariant 
follows from the inductive hypothesis. 
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Consider now the case x ^ v or X ^V. In this case, by the code, we have that s' .info-rcvd\p, g] q ^ 
(x, X). Since C is true in s', it must be that ("info", x, X) G msgs-to-vs[g] p or {"info" , x, X) G 
pending\p, g] or (("info" ,x, X) ,p) G queue[g] is true in s'. Variables msgs-to-vs[g\ p , pending\p, g] 
and queue[g] are not changed by n. Hence C is true in s. The invariant follows from the in- 
ductive hypothesis. 

3. tv = vs-GTSim(("info" ,v,V)) p . 

If g 7^ client- cur. id p then since C is true in s' it is true also in s (for the given p,q,g,x,X). 
Thus the inductive hypothesis is true. Since the code does not change info-sent[g] p and cur.id p , 
the invariant follows from the inductive hypothesis. 

Hence assume that g = client- cur.id p . First consider the case x = v and X = V. In this case, 
by the precondition of n (see DVS-IMPL) we have that (("info" ,x,X),p) G msgs-to-vs[g]. Then 
the invariant follows from the inductive hypothesis. 

Consider now the case x ^ v or X ^ V. Since C is true in s 1 we have that C is true in s too. 
Indeed no ("info" ,x,X) message is deleted and info-rcvd\p, g] q is not changed. The invariant 
follows from the inductive hypothesis. 

4. 7r = vs-order({ "m/o" ,v,V),p,g). 

First consider the case x = v and X = V. In this case, by the precondition of n we have that 
(("info" ,x,X),p) G pending[g]. Then the invariant follows from the inductive hypothesis. 

Consider now the case x ^ v or X ^ V. Since C is true in s' we have that C is true in s too. 
Indeed no ("info" ,x,X) message is deleted and info-rcvd\p, g] q is not changed. The invariant 
follows from the inductive hypothesis. 

5. Other actions. 

Condition C never changes from false to true and variables info-sent[g] p and cur.id p are not 
modified. Hence the assertion cannot be made false. 

D 

The following invariant states that if a "registered" message for view v has been sent by process 
p then variable reg[v.id] p is set to true (that is, the view has been registered by the client at p). 

Invariant 5.2.8 (dvs-IMPL) 

In any reachable state, let C be the following condition: 

("registered") G msgs-to-vs[g] or ("registered") G pending\p,g\ or ("registered" ,p) G 
queue[g] or rcvd-rgst\p, g\ q = true. 

If C is true then reg[g] p = true. 
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Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p,g,q. In the initial state we have that msgs-to-vs[g] = A, 
pending\p,g\ = A, queue[g] = A and rcvd-rgst\p, g\ q = false. Hence C is false in the initial state 
and the invariant is vacuously true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s' 
for any possible step (s,ir, s'). Fix p, g, q and assume that C is true in s'. 

1. 7T = DVS-REGISTERp. 

If s. client- cur. id p 7^ g then C is true also in s and the invariant follows from the inductive 
hypothesis. Hence assume s. client-cur. id p = g. By the code of n we have that we have 
reg[g] p = true. 

2. n = vs-G¥SNv({"registered" )) p . 

If s.current-viewid\p] 7^ g then C is true also in s and the invariant follows from the inductive 
hypothesis. Hence assume g = s.current-viewid\p\. By Invariant 5.2.1 we have that s.cur.id p = 
s.current-viewid\p\. Hence s.cur.id p = g. By the precondition of n (see DVS-IMPL) we have 
that {" registered") € s.msgs-to-vs[g] p . Hence C is true in s and the invariant follows from the 
inductive hypothesis. 

3. tv =vs-order({ "registered" ,p' ,g')). 

If P 1 ¥" P or d' ¥"9 then C is true also in s and the invariant follows from the inductive 
hypothesis. Hence assume p' = p and g' = g. By the precondition of n we have that 
("registered") G s.pending\p,g\. Hence C is true also in s and the invariant follows from 
the inductive hypothesis. 

4. Other actions. 

Condition C never changes from false to true and variable reg[g] p is not modified. Hence the 
assertion cannot be made false. 

D 

The following invariant states some facts about views in TbtTieg. 

Invariant 5.2.9 (dvs-IMPl) 
In any reachable state: 

1. act p G TbtTieg. 

2. If info-sent[g] p = (x,X) then x € TbtTieg. 

3. use p n TbtTieg ^ 0. 
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Proof: First notice that Part 3 follows easily from Part 1 and the fact that, by definition, act p £ use p . 
Hence we only need to prove Parts 1 and 2. 

By induction on the length of the execution. The base case consists of proving that the invariant 
is true in the initial state. For Part 1, fix p. In the initial state act p = v and v is totally registered 
by definition. For Part 2, fix p, g. In the initial state info-sent[g] p = ±. Hence the invariant is 
vacuously true. 

For the inductive step assume the invariant is true in s. We need to prove that it is true in s' for 
any possible step (s,tt, s'). Fix p, g, x and X. We prove the invariant by considering each possible 
action. 

1. 7T = VS-NEWVIEW(^) P . 

Part 1 is still true in s' because act p is not modified (as well as TbtTZeg). 

Consider Part 2 now. Assume that s' .info-sent[g] p = (x,X). liv.id 7^ g then s.info-sent[g] p = 
(x,X) then by the inductive hypothesis we have that x £ s.TbtTZeg. Since no view is ever 
removed from TbtTZeg we have that x £ s 1 .TbtTZeg, as needed. Hence we can further assume 
that v.id = g. Since s 1 .info-sent[g] p = (x,X) and action it sets info-sent[g] p = (act p , amb p ) it 
must be that s.act p = x and s.amb p = X. 

By the inductive hypothesis, Part 1, we have that s.act p € s.TbtTZeg. But x = s.act p and no 
view is removed from TbtTZeg. Hence x £ s' .TbtTZeg. Thus Part 2 is still true in s' . 

2. 7T = VS-GPRCV({"m/o",^, V)) p ,q. 

Consider Part 1 first. If s'.act p = s.act p then Part 1 follows by the inductive hypothesis. Hence 
assume that s'.act p 7^ s.act p . By the code we have that s'.act p = v. Thus we have to prove 
that v £ TbtTZeg. By the precondition of n (in vs) we have (("info", v, V),q) £ s .queue[cur .id p ]. 
Then Invariant 5.2.7 implies that s.info-sent[cur.id p ] q = (v,V). By the inductive hypothesis, 
Part 2, we have that v £ s. TbtTZeg, as needed. 

Part 2 is preserved because info-sent[g] p is not modified. 

3. 7T = DVS-GARBAGE-COLLECT(«) p . 

Consider Part 1 first. If s'.act p = s.act p then Part 1 follows by the inductive hypothesis. 
Hence assume that s' .act p 7^ s.act p . By the code we have that s' .act p = v. Hence we have to 
prove that v £ TotVeg. By the precondition of it we have that rcvd-rgst[q, v.id] = true for all 
q £ v.set. Then Invariant 5.2.8 implies that v £ TbtTZeg. 

Part 2 is preserved because info-sent[g] p is not modified. 

4. Other actions. 
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Variables act p , info-sent[g] p (as well as TbtTZeg) are not modified. Hence the assertions cannot 
be made false. 

D 

The following invariant states that if process q is in a view which has been attempted by process p 
(which may or may not be q itself) then the current view of q is either vora later one. 

Invariant 5.2.10 (dvs-IMPl) 

In any reachable state, if v G attempted and q G v. set then cur.id q > v.id. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p, v and suppose that v G attempted and q G v. set. If 
p G' Po then attempted p = 0, a contradiction. On the other hand, if p G P then since v G attempted p , 
it must be that v = vo. Moreover since q G v. set we have that q e Po. Hence cur q = vo, so 
cur.id q > v.id, as needed. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s' for any possible step (s, n, s'). Fix p and v and assume that v G s 1 .attempted and q G v. set. We 
distinguish two cases. 

I.116 s. attempted . 

By the inductive hypothesis we have that s.cur.id q > v.id. By the monotonicity of cur. id we 
have that s'.cur.id q > s.cur.id q . 

2. v £ s. attempted . 

It must be n = dvs-newview(^) p . We consider two possible cases: q = p and q 7^ p. 

Assume that q = p. Then Invariant 5.2.2 implies that s'. client- cur p > v.id. Since s 1 .cur.id p = 
s' .client- cur p , we have that s'.cur.id p > v.id, as needed. 

Assume that q 7^ p. Then the precondition of n says that s.info-rcvd[q, v.id] 7^ ±. By Invariant 
5.2.7 (used with p and q interchanged) we have that cur.id q > v.id, as needed. 



The following invariant states properties of views in the use set. 

Invariant 5.2.11 (dvs-IMPl) 
In any reachable state: 

1. If cur p 7^ ± and w G use p , then w.id < cur.id p . 

2. If cur p 7^ ± and client-cur p 7^ cur p and w G use p , then w.id < cur.id p . 
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3. If info-sent[g] p = (x, X) and w £ {a;} U X then w.id < g. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Consider Part 1 first. In the initial state we have that use p is 
either empty or contains only vo. In the former case Part 1 is vacuously true. In the latter case we 
have that w = vo and the invariant follows from the fact that go is the minimum element of Q. Parts 
2 and 3 are vacuously true. Indeed in the initial state client-cur p = cur p and info-sent[g] p = ±. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s 1 for any possible step (s,ir, s'). Fix p, g, x, X and w. 

We prove that the invariant is still true in s' by considering each possible action n. 

1. 7T = VS-NEWVIEW(*))p 

First consider Part 1. Assume that s'.cur p ^ _L and w € s'.use p . Then w 6 s.use p . If 
s.cur p = ±, then, by Invariant 5.2.6, w = vo. Since vo.id is the minimum element of Q, 
we have that w.id < s'.cur.id p . So assume that s.cur p ^ ±. In this case, by the inductive 
hypothesis, Part 1, we have that w.id < s.cur.id p , which implies w.id < s' .cur.id p . 

Hence Part 1 is still true in s' . Since we actually proved that w.id < s'.cur.id p also Part 2 is 
still true in s'. 

Now consider Part 3. Assume that s' .infos ent[g] p = (x,X) and w € {x}L)X. If g ^ v. id then 
we have that s.info-sent[g] p = (x,X). By the inductive hypothesis, Part 3, we have w.id < g, 
as needed. Hence assume g = v. id. By the code of n, we have that s.use p = {x} U X. Now if 
s.cur p = ±, then by Invariant 5.2.6, w = vq. Since vg.id is the minimum element of Q, we have 
that w.id < v.id = g, as needed. So assume further that s.cur p ^ ±. In this case, the inductive 
hypothesis, Part 1, implies that w.id < s.cur.id p , which implies w.id < s'.cur.id p = v.id = g, 
as needed. 

2. IT = DVS-NEWVIEW(^) P 

Consider Part 1 first. The only possible new element added to use p is v. Since v = s'. cur. id, 
Part 1 still holds in s'. Part 2 is vacuously true, because s' .client- cur p = s'.cur p . Part 3 is 
preserved because info-sent[g] p is not modified. 

3. 7T = DVS-GARBAGE-COLLECT(^) p 

Consider Part 1. Assume that s' .cur p ^ ± and that w G s' .use p . By the code s'.cur p = s.cur p . 
If w € s.use p then by the inductive hypothesis Part 1 is true in s and thus it is still true in s'. 
Hence assume that w $ s.use p . By the code, this cannot happen because no view is added to 
use p . 

Part 2 can be proved in a similar way. Part 3 is preserved because info-sent[g] p is not modified. 
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4. 7T = VS-GPRCV({"m/o",a;,X))q j j, 

The proof is exactly as in the previous case. 

5. Other actions. 

Variables use p , cur p , client-cur p and info-sent[g] p are not modified. Hence none of the asser- 
tions can be made false. 

D 

The following three invariants, say that certain views appear in use sets, or in "info" messages, 
unless they have been garbage-collected. 

Invariant 5.2.12 (dvs-IMPl) 

In any reachable state, if w G attempted then either w G use p or w.id < act.id p . 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. Fix p, w and suppose that w G attempted . If p g" Po then 
attempted = 0, a contradiction. On the other hand, if p G Po then since w G attempted , it must 
be that w = v$. But in this case also act p = vo, so vq g use p , as needed. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s' for any possible step (s,n,s'). So fix w and p such that w G s' .attempted p . We distinguish two 
possible cases. 

1. w G s. attempted . 

By the inductive hypothesis we have that either w G s.use p or w.id < s.act.id p . In the 
latter case, because of the monotonicity of act.id p , we have w.id < s'.act.id p . So assume that 
w G s.use p . If w G s'.use p we are done, so assume further that w £ s'.use p . Then it must be 
that either n = dvs-garbage-collect(^) p or 7r = vs-GPRcv({"m/o",a;,X)) r . j p for some r. In either 
case, the code implies that s'.act p > w.id. 

2. w s. attempted . 

It must be it = dvs-newview(^) p . By the code, view v is inserted into attempted , but also into 
amb p (and hence into use p ). Thus the invariant is still true in s'. 

D 

Invariant 5.2.13 (dvs-IMPl) 

In any reachable state, if info-rcvd[q, g\ p = (x,X) and w G {#} U X, then either w G use p or 

w.id < act.id v . 
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Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state info-rcvd[q, g\ p = ± for any p,q,g. Hence 
the statement is vacuously true. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in s' 
for any possible step (s,7r, s'). Fix p, q, g, x, X and w, and assume that s 1 .info-rcvd[q, g\ p = (x,X), 
and w G {x} U X. We consider two cases: 

1. s.info-rcvd[q, g] p = (x, X) 

By the statement applied to s, we obtain that either w G s.use p , or s.act.id p > w.id. In the 
latter case, s'.act.id p > w.id, because of monotonicity of act.id p . So assume that w 6 s.use p . 
If w £ s'.use p then we are done, so assume further that w $ s'.use p . (That is, w is garbage- 
collected.) 

Then it must be that either it =dvs-garbage-collect(^) p or it = vs-gprcv( {"in/o" ,x,X)) TyV for 
some r. In either case, the code implies that s'.act p > w.id. 

2. s.info-rcvd[q,g\ p ^(x,X) 

Then it = vs-GPRcv(("«n/o" ,x,x)) qtP . If w e s 1 .use p then we are done. Hence assume that 
w $ s'.use p . By the code, we have that s'.act p > w.id (that is, w is garbage-collected). 

D 

Invariant 5.2.14 (dvs-impl) 

In any reachable state, if info-sent[g] p = (x,X), w £ attempted , and w.id < g, then either w £ 

{x} U X or w.id < x.id. 

Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, info-sent[g] p = ± for all g,p, so the 
statement is vacuously true. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s' for any possible step (s,tt, s'). Fix p, g, w, x, and X, and assume that s' .info-sent[g] p = (x,X), 
w G s 1 .attempted p , and w.id < g. We consider four cases: 

1. s.info-sent[g] p = (x,X) and w € s. attempted . 

Then the statement for s implies that either w € {x} U X or w.id < x.id. In either case the 
statement is true in s 1 also. 

2. s.info-sent[g] p ^ {x,X) and w (£ s. attempted . 

This cannot happen because both conditions cannot become true in a single step: the first only 
becomes true by means of a vs-newview(^) p , for some view v, while the second only becomes 
true by means of dvs-newview(«;)p. 
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3. s.info-sent[g] p 7^ (x,X) and w G s. attempted p . 

It must be it = vs-newview(^) p , for some v, x must be s.act p , and X must be s.amb p . In- 
variant 5.2.12 implies that either w G s.use p or w.id < s.act.id p . Now, s.use p = {s.act p } U 
s.ambp = {x} U X. So we have that either w; € {#} U X or wj.id < x.id, as needed. 

4. s.m/o-sent[<?]p = (x,X) and u> ^ s. attempted . 

Then 7r must be dvs-newview(«;)p. We claim that this cannot happen: Since s.info-sent[g] p = 
(x,X), by Invariant 5.2.3 we have s.cur.id p > g. Since g > w.id, we have s.cur p > w.id. But 
the precondition of 7r requires that s.cur p = w.id. Hence n is not enabled in state s. 

U 

Invariant 5.2.15 says that two attempted views having no intervening totally registered view, 
and having a common member, q, that has attempted the first view, must intersect in a majority 
of processors. This is because, under these circumstances, information must flow from q to any 
processor that attempts the second view. 

Invariant 5.2.15 (dvs-IMPl) 

In any reachable state, suppose that v G attempted , q G v. set, w G attempted , w.id < v.id, and 

there is no x G TbtReg such that w.id < x.id < v.id. Then \v.set fl w.set\ > \w.set\/2. 

Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, only vo is attempted, so the hypotheses 
cannot be satisfied. Thus, the statement is vacuously true. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s' for any possible step (s,ir, s'). Fix v, w, p, and q, and assume that v G s' .attempted , q G v. set, 
w G s'. attempted q: w.id < v.id, and there is no x G s 1 .TbtTZeg such that w.id < x.id < v.id. Then 
also there is no x G s.lbtTZeg such that w.id < x.id < v.id. We consider four cases: 

1. v G s. attempted and w G s. attempted . 

Then the statement for s implies that \v.set n w.set\ > \w.set\/2, as needed. 

2. v ^ s. attempted and w $ s. attempted . 

This cannot happen because we cannot have both v and w becoming attempted in a single 
step. 

3. v ^ s. attempted and w G s. attempted . 

Then 7r must be dvs-newview(^) p . Since q G v. set, by the precondition of 7r we have that 
s .info-rcvd[q, v.id] p = (x, X) for some x and X. Then Invariant 5.2.7 implies that s .info-sent[v Ad] q 
{x,X). Then (since w.id < v.id), Invariant 5.2.14 implies that either w G {x} U X or 
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w.id < x.id. If w.id < x.id, then we obtain a contradiction. Indeed by Invariant 5.2.9 
x G s.TbtVeg and by Invariant 5.2.11, Part 3 (used with w = x) we have x.id < v.id. This 
contradicts the hypothesis. So w G {x} U X. 

Now by Invariant 5.2.13 we have that either w G s.use p or w.id < s.act.id p . In the former 
case, by the precondition of n, we have \v.setDw.set\ > \w.set\/2. In the latter case, we obtain 
a contradiction. Indeed by Invariant 5.2.9 we have s.act p G TbtTZeg. Moreover by the precon- 
dition of 7r, s.cur p cannot be _L and s.cur p > s. client- cur - p and, by definition, s.act p G s.use p . 
Hence by Invariant 5.2.11, Part 2, we have s.act.id p < s.cur.id p = v.id. Thus we would have 
a totally registered view act such that w.id < act. is < c.id. This contradicts the inductive 
hypothesis. 

4. v G s. attempted and w $ s. attempted . 

Then n must be Dvs-NEwviEw(«;)q. But this cannot happen. Indeed since v G s. attempted 
and q G v. set, Invariant 5.2.10 implies that s.cur.id q > v.id. Since v.id > w.id, we have 
s.cur.idg > w.id. But the precondition of action tt requires s.cur.id q = w.id, so 7r is not 
enabled in s. 

U 

Invariant 5.2.16 says that any attempted view v intersects the latest preceding totally registered 
view wina majority of members of w. 

Invariant 5.2.16 (dvs-IMPl) 

In any reachable state, suppose thatv e Att, andw e TbtTZeg, w.id < v.id, and there is no x € IbtTZeg 

such that w.id < x.id < v.id. Then \v.set fl w.set\ > \w.set\/2. 

Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, only vq is attempted, so the hypotheses 
cannot be satisfied. Thus, the statement is vacuously true. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in s' 
for any possible step (s, n, s'). Fix v and w, and assume that v G s 1 '.Att, w G s'.Tbtlleg, w.id < v.id, 
and there is no x G s'.TbtTteg such that w.id < x.id < v.id. We consider four cases: 

1. v G s.Att and w G s.TotVs-g. 

Then, from the inductive hypothesis we have \v.set fl w.set\ > \w.set\/2. 

2. v £ s.Att and w ^ s.TotVeg. 

This cannot happen because we cannot have both v becoming attempted and w becoming 
totally registered in a single step. 
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3. v £ s.Att and w G s.TbtTZeg. 

Then n must be dvs-newview(^) p for some p. The precondition of n implies that, for any view 
y G s.use p , \v.set fl y.set\ > |y.set|/2. Hence to prove the claim it is enough to prove that 
w G s.use p . We proceed by contradiction assuming that w £ s.use p . 

By Invariant 5.2.9, Part 3, s.use p fl s.TbtTZeg 7^ 0. Let m be the view in s.use p fl s.TbtTZeg 
having the biggest identifier. We know that m^w because w £ s.use p . Also, m 7^ v, because 
m G s.TbtTZeg and n ^ s.TbtTZeg. It follows that m.id 7^ v. id. 

We claim that m.«d < w.id. We have already shown that m.id 7^ w.id. Suppose for the sake 
of contradiction that m.id > w.id. From the precondition of action 7r we have that s.cur = v 
and hence s.cur 7^ ±. Also from the precondition of n we have that s. client-cur p < s.cur p . 
Since m G s.use p , Invariant 5.2.11, Part 2, implies that m.id < s.cur.id p and since s.cur = v 
we have we have m.id < v. id. So w.id < m.id < v. id. Since m € s' .IbtVeg, this contradicts 
the hypothesis of the inductive step. Therefore, m.id < w.id. 

Let n be the view in s.lbtVeg that has the smallest id strictly greater than that of m. Remember 
that w G s'.lbtTZeg and since 7r =dvs-newview(^) p we have that w G s.TbtTZeg; thus n exists and it 
holds m.id < n.id < w.id < v. id. Since m G s.use p , the precondition of n implies that \v.setn 
m.set\ > \m.set\/2. By the statement applied to state s, \n.set fl m.set\ > |m.set|/2. Hence 
there exists a processor q G v.setCin.set. By the precondition of 7r, s.info-rcvd[q, v.id] p = (x, X) 
for some x,X. Then Invariant 5.2.7 implies that s.info-sent[v.id] q = (x,X). Then Invariant 
5.2.11, Part 3 (used with w = x), implies that x.id < v.id. Since n G s.TbtTieg, we have that 
n G s. attempted q . Then Invariant 5.2.14 (used with w = n) implies that either n G {x} U X 
or n.id < x.id. In either case, {x} U X contains a view y G s.TbtVeg (either n or x) such that 
n.id < y.id < v.id. Then Invariant 5.2.13 implies that either y G s.use p or y.id < s.act.id p . By 
Invariant 5.2.9, Part 1, s.act p G s.TbtVeg and by definition, s.act p G s.use p . So in either case, 
the hypothesis that m is the totally registered view with the largest id belonging to s.use p is 
contradicted. 

4. v G s.^4ii and w; ^ s.TbtVeg. 

Then 7r must be dvs-register p for some p. Let m be the view in s.TbtTZeg with the largest id 
that is strictly less than w.id. By the statement for s, we know that \w.setr\m.set\ > \m.set\/2 
and |v.se£ fl m.set\ > |m.set|/2. Hence there is a processor 5 G w.set fl v. set. 

Since v G s.^4ii, there exists a processor r such that v G s. attempted r . Thus also t; G 
s 1 .attempted r . Since u> G s 1 .TbtTZeg, we have that w; G s' .attempted . By assumption, there 
is no view x G s' .TbtTZeg such that w.id < x.id < v.id. By Invariant 5.2.15 applied to state s' 
(with p = r), we have that |v.se£ fl w.set\ > |w.se£|/2, as needed. 

D 

57 



The final invariant, a corollary to Invariant 5.2.16, is instrumental in proving that DVS-IMPL 
implements DVS. 

Invariant 5.2.17 (dvs-impl) 

In any reachable state, ifv, w G Att, w.id < v.id, and there is no x e TotVeg with w.id < x.id < v.id, 

then v.set fl w.set 7^ 0. 

Proof: Suppose that v and w are as given. We consider two cases. 

1. w e TotVeg. 

Since there is no x € TotVeg, Invariant 5.2.16 implies that \v.set fl w.set\ > \w.set\/2, which 
implies that v.set n w.set 7^ {}, as needed. 

2. w <£ TotVeg. 

Then let Y = {y\y € TotVeg, y. id < w.id}. We first show that Y is nonempty: Invariant 5.2.4 
implies that vo G TotVeg and that fo-id < u>.«d. If v^.id = w.id, then by Invariant 4.1.1, we 
have w = vo- But then w € TotVeg, a contradiction to the definition of this case. So we must 
have vo.id < w.id, which implies that i>o £ Y, so Y is nonempty. 

Now fix z to be the view in Y with the largest id. We have that there is no x 6 TotVeg 
with zid < x.id < v.id. Then Invariant 5.2.16 implies that \w.set fl z.set] > \z.set\/2 and 
|v.sei n z.set\ > |^.sei|/2. Together, these two facts imply that v.set fl w.set 7^ {}, as needed. 

D 

5.2.3 Proof that dvs-impl implements dvs 

We prove that dvs-impl implements dvs by defining a function Td vs that maps states of dvs-impl 
to states of DVS and proving that this function is a abstraction function. Section 5.2.3 contains the 
definition of the function Tdvs an d some auxiliary lemmas while Section 5.2.3 contains the proof 
that Fdvs is an abstraction function. 

The abstraction function for DVS-IMPL. 

dvs-impl uses vs to send client messages and messages generated by the implementation ("info" 
and "registered" messages). The abstraction function discards the non-client messages. Thus, if q 
is a finite sequence of client and non-client messages, we define purge(q) to be the queue obtained 
by deleting any "info" or "registered" messages from q, and purgesize(q) to be the number of "info" 
and "registered" messages in q. Figure 5-4 defines the abstraction function Fdvs- 

Next we give some simple consequences of the definition of Fdvs- They deal with the messages 
delivered by dvs-impl. They state that these messages are exactly the ones that dvs would deliver 
to the client. 
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Let s be a state of dvs-impl. The state t = Fdvsis) 


of DVS is the following. 






• t. created = U p ^-ps. attempted 










• for each p € P, 


t.current-viewid\p] = s. client- cur. id p 








• for each g e Q, 


t.attempted[g] = {p\g = v.id,v < 


= s. attempted } 








• for each g € Q, 


t.registered[g] = {p\s.reg[g] p } 










• for each peP, 


g £ 5, t.pending\p, g] = purge(< 


>.pending\p, g])opurge(s.msgs 


-to-Vi 


[g] P ) 


• for each g € Q, 


t.queue[g] = purge(s.queue[g]) 










• for each p € P, 


gcG, 










t.next\p,g\ = s 


next\p,g] — purgesize(s.queue[g](l..next\p,g] — 1)) 


— |s.m,s<?s- 


from- 


vs[g] p \ 


• for each p € P, 


jg 5, t.next-safe\p,g] = 










s.next-safe\p, g 


— purgesize(s.queue[g](l..next- 


safe\p,g]- 1)) - |s. 


safe-from 


-vs[g\ 


l\ 



Figure 5-4: The abstraction function Ti vs . 

Invariant 5.2.18 (dvs-impl) 

In any reachable state s, if s.msgs-frvm-vs[g] p = {{mi,q>i), (m 2 , q 2 ), ■■■, {fn k , Ik)), then we have that 

Fdvs(s).queue[g](next\p, g]..next\p, g] + k - 1) = {{mi, q x ), (m 2 ,q2), ■■■, (m k ,q k )). 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state no message is in msgs-from-vs[g] p . Hence the 
invariant is vacuously true. 

For the inductive step, assume that the invariant is true in state s. We need to prove that it is 
true in state s' for any possible step (s, it, s'). Fix p, g and m\,q\,m 2 ,q 2 , ■■■, m k ,q k and assume that 
s' .msgs-frvm-vs[g] p = {{mi, qi), (m 2 , qi), ■•■, {m k , Ik))- We distinguish the following cases. 

1. s.msgs-from-vs[g] p = ((m 1 ,q 1 ),...,{m k - 1 ,q k _ 1 )). 

It must be n =vs-GPRcv(m k ) qk , p . By the inductive hypothesis we have that 
F dvs (s).queue[g](next\p, g]..next\p, g] + k - 2) = ((m 1 ,q 1 ),...,{m k _ 1 ,q k _ 1 )). 

By the code in vs we have that next\p, g] is increased by one and by the code in dvs we have that 
the size of msgs-frvm-vs[g] p also increases by one. Hence by the definition of Tdvs, we have 
that T& vs {s').next\p, g] = J r dvs (s).next[p,g]. Moreover Fdvs{s')- queue[g] = J r dvs (s).queue[g] 
and by the precondition of n we have that J r dv3(s)-queue[g](s.next\p, g] + k — 1) = (m k ,q k ). 
Thus the invariant is still true in s'. 

2. s.msgs-from-vs[g] p = {{m,q),{m 1 ,q 1 ),{m 2 ,q2),-,{m k ,q k )). 

Then it =Dvs-GPRcv(m)q, p . By the inductive hypothesis we have that 
Fdvs(s).queue[g](next\p, g]..next\p, g] + k) = ((m,q),(m 1 ,q 1 ),(m 2 ,q2), -,( m k-i,qk-i))- 
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By the code we have that next\p, g] is incremented by one. Since !Fd vs (s').queue[g\ = Tdvs{s) .queue[g\, 
the invariant is still true in s' . 

3. s.msgs-from-vs[g\ p = s' .msgs-from-vs[g] p 

By the inductive hypothesis the assertion is true in state s. For any possible action in this 
case !Fd vs (s').next\p, g] = !Fdvs(s).next\p, g] and the portion of J r dvs(s)-queue[g] involved in the 
statement of the invariant never changes because messages are only appended to queue[g]. 
Thus the assertion cannot be made false. 

4. Other cases. 

Not possible. Indeed msgs-from-vs[g] p either stay the same or is changed by appending a 
message or deleting the head. 

D 

The following invariant follows easily from the previous one. It just states that the next message 
delivered by dvs-impl to a processor p is the same one that dvs delivers. 

Invariant 5.2.19 (dvs-impl) 

In any reachable state s, if (m,q) is head of s.msgs-from-vs[g] p , then !Fd vs (s).queue[g\(next\p, g\) = 

(m,q). 

Proof: Follows easily from previous one. □ 

Similar invariants hold for the delivery of safe messages. 

Invariant 5.2.20 (dvs-impl) 

In any reachable state s, we have that if s.safe-from-vs[g] p = ((mi,qi), (m^, qi), ■■■, {mkiQlk)), then 

J 7 dvs(s).queue[g](next-safe\p, g], next-safe\p, g] + k - 1) = ((mi,qi), (m 2 , q 2 ), —, (m k , q k )). 

Proof: The proof is as for msgs except that it uses the safe-from-vs queue instead of msgs-from-vs 
and the pointer next-safe instead of next. □ 

Invariant 5.2.21 (dvs-impl) 

In any reachable state s , if(m,q) is head of s.safe-frvm-vs[g] p , thenJ r dvs(s)-qtieue[g](next-safe\p,g]) = 

(m,q). 

Proof: Follows easily from previous one. □ 

Notice that v is totally registered in state s of DVS-IMPL if and only if it is totally registered in the 
state of dvs that appears in state Fdvsi 8 ) of dvs. 
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Proof that Tdvs ls abstraction an function. 

In order to prove that Tdvs is an abstraction function we need to prove that for any initial state 
s of dvs-impl we have that Tdvs(s) is an initial state of dvs and that for any possible step n of 
dvs-impl there exists a sequence of a of steps of dvs such that the trace of a, that is the externally 
observable behavior, is equal to the trace of n. Lemmas 5.2.22 and 5.2.23, prove the above. 

Lemma 5.2.22 If s is an initial state of DVS-IMPL then Td V s( s ) is an initial state of DVS. 

Proof: Let so be the unique initial state of DVS-IMPL and to the unique initial state of DVS. 

We have so .attempted p = {vo} for p € Po and so- attempted p = for p ^ Po- By the definition of 
Tdvs and the fact that P 7^ (because all membership sets are defined to be nonempty), we have 
Tdvs(so)- created = {i>o}- This is as in to. 

We have So. client- cur p = {vo} for p € Po and So- client- cur p = ± for p £ Po- By the definition 
of Tdvs we have Tdvs(so)-current-viewid\p] = go for p £ Po and Tdvs(so)-current-viewid\p] = ± for 
p Po ■ This is as in to • 

We have so .attempted = {vo} for p € Po and so- attempted = for p £ Po- By the definition of 
Tdvs we have Tdvs(so).attempted[go] = Po and Td V s(so)- attempted[g] = for g 7^ go- This is as in to. 

Let g e Q. We have that so-reg[g] p is true if and only if p G Po and g = go- By the definition of 
Tdvs we have Tdvs(s ).registered[g ] = Po and Td vs (so)-registered[g] = for g ^ g , as in t . 

Let p S V. We have that so-msgs-to-vs[g] p = X and so-pending\p, g] = X. By the definition of 
Tdvs we have Tdvs(s ) .pending\p, g] = A, as in t . 

Let g G Q. We have so-queue[g] = X. By the definition of Tdvs we have Tdvs(so)-queue[g\ = A, 
as in to. 

Let p e V, g G Q. We have So-next\p, g] = 1, purgesize(s .VS.queue[g]) = and So-msgs-from-vs[g\ p ■ 
X. By the definition of Tdvs we have Tdvs(so)-next\p, g] = 1, as in to. A similar argument holds for 
next-safe. 

Thus Tdvs{so) =to, as needed. □ 

Lemma 5.2.23 Let s be a reachable state of DVS-IMPL, Td vs ( s ) a reachable state of DVS- SYS, and 
(s, 7r, s 1 ) a step of DVS-IMPL. Then there is an execution fragment a of DVS- SYS that goes from 
Tdvs(s) to Tdvs(s'), such that trace(a) = trace(ir). 

Proof: By case analysis based on the type of the action n. (The only interesting case is where n = 
dvs-newview(^) p .) Define t = Tdvs(s) and t' = Tdvs(s')- 

1. TV = VS-CREATEVIEW(^) 

Then trace((s, n, s')) = X. Action n modifies created. The definition of Tdvs is not sensitive 
to this change. Therefore, t = t', and we set a = t. 
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2. 7T = VS-NEWVIEW(*))p 

Then tracers, it, s')) = A. Action 7r modifies cur p , info-sent[cur.id] p , and current-viewid\p], 
and adds an "info" message to msgs-to-vs[cur.id] p . The definition of Tdvs is not sensitive to 
any of these changes. Therefore, t = t', and we set a = t. 

3. 7T = VS-GPSND(m) p 

Then trace ((s,ir,s')) = A. Action 7r just moves a message from the queue msgs-to-vs[cur.id] p 
to the queue pending\p, current-viewid\p]]. The definition oiFdvs is not sensitive to this change. 
Therefore, t = t', and we set a = t. 

4. TV = VS-ORDER(m,p,g) 

Then trace ((s,ir,s')) = A. Action n moves a message from pending\p,g\ to queue[g]. We 
consider two cases. 

(a) m G M c 

Then we set a = (t, Bvs-oRBER(m,p, g),t') . We claim that Dvs-oRDER(m,p,g) is enabled in t: 
Since vs-oRDER(m,p,p) is enabled in s, it follows that m is the head of s.pending\p,g]. By 
the definition of Tdvs, m is also the head of t.pending\p, g\. It follows that Dvs-oRDER(™,p,g) 
is enabled in t. 

By definition of Fdvs, t 1 differs from t only in the fact that m is moved from pending\p, g] 
to queue[g]. This is the effect achieved by applying Dvs-oRDER(m,p,g) to t. 

(b) m ^ M c 

Then the definition of Tdvs is not sensitive to this change. Therefore, t = t', and we set 
a = t. 

5. 7T = VS-GPRCV({"m/o",«,s)) 9> j, 

Then trace((s, it, s')) = A. This action can modify info-rcvd[cur.id p ,q] p , act p and amb p (see 
code of dvs) and causes next\p,cur.id p ] to be incremented (see code of vs). The defini- 
tion of Tdvs is not sensitive to these changes. (The only interesting case is the definition of 
t.next\p, cur.idp], where the absolute values of the first two terms on the right-hand side are 
both increased by 1, but they cancel each other out.) Therefore, t = t', and we set a = t. 

6. 7T = vs-gtrcv ("registered" ) p 

Then trace((s,n,s') = A. This action can modify rcvd-rgst[cur.id,q] p . It also causes the 
pointer next\p, cur.id p ] to be incremented. The definition of Tdvs is not sensitive to these 
changes. (The only interesting case is the definition of t.next\p, cur.id p ], where the absolute 
values of the first two terms on the right-hand side are both increased by 1, but they cancel 
each other out.) Therefore, t = t', and we set a = t. 
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7. 7T = VS-GPRCV(m) p , m € M c 

Then trace((s, n, s')) = A. This action copies a message from the sequence queue[cur.id] p to 
the sequence msgs-from-vs\p, client- cur '[p]], and causes next\p, cur.id p ] to be incremented. The 
definition of J 7 ^ is not sensitive to these changes. (The only interesting case is the definition 
of t.next\p, cur.idp], where the absolute values of the first and third terms on the right-hand 
side are both increased by 1, but they cancel each other out.) Therefore, t = t', and we set 
a = t. 

8. 7r = vs-sAFE({m,v,s)) q!P , m G {"info" , "registered"} 

Then tracers, it, s 1 )) = A. Action 7r just causes next-safe\p, cur.idp] to be incremented. The 
definition of Tdvs is not sensitive to this change. (The only interesting case is the definition of 
t.next-safe\p, cur.idp], where the absolute values of the first two terms on the right-hand side 
are both increased by 1, but they cancel each other out.) Therefore, t = t', and we set a = t. 

9. 7T = VS-SAFE(m) p , 771 G M c 

Then trace((s, n, s')) = A. Action n adds a message to safe-from-vs[cur.id] p and causes the 
pointer next-safe\p, cur.idp] to be incremented. The definition of T^vs is n °t sensitive to 
these changes. (The only interesting case is the definition of t.next-safe\p, cur.idp], where the 
absolute values of the first and third terms on the right-hand side are both increased by 1, but 
they cancel each other out.) Therefore, t = t', and we set a = t. 

10. 7T = DVS-NEWVIEW(^) P 

Then trace((s, n, s)) = n. In DVS-IMPL, this action modifies only variables amb p , attempted p , 
client-cur-p. We have s 1 .client- cur p = v and s'. attempted = s. attempted U {v}. By definition 
of Fdvsi we have that t' .current-viewid\p] = s' .client-cur. id p = v.id, t 1 .created = t. created U{v} 
and t'.attempted[v.id] = t.attempted[v.id] U {p}, while all other state variables in t' are as in t. 

We consider two cases: 

(a) v G t. created. 

In this case, we set a = {t,n',t'), where tv' = dvs-newview(^) p . The code shows that n' 
brings dvs-sys from state t to state t'. It remains to prove that n' is enabled in state t, 
that is, that v € t. created and v.id > t.current-viewid\p]. The first of these two conditions 
is true because of the defining condition for this case. The second condition follows from 
the precondition of n in DVS-IMPL: this precondition implies that v.id > s. client-cur. id p , 
and by the definition of Tdvs we have t.current-viewid\p] = s. client- cur.idp. 

(b) v £ t. created. 

In this case we set a = (t,ir',t",ir",t'), where 7r' = dvs-createview(^) p , n" = dvs- 
newview(d) p , and t" is the unique state that arises by running the effect of n' from t. 
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The code shows that a brings dvs-sys from state t to state t'. It remains to prove that 
tt' is enabled in t and that tt" is enabled in t" . 

The precondition of it' requires that (i) Vu> G t. created, v.id 7^ w.id and (ii) Vu> € 
t. created, either 3x G s.TotAtt satisfying w.id < xid < vid or v.id < x.id < w.id, or else 
v.set fl w.set 7^ 0. 

To see requirement (i), suppose for the sake of contradiction that w G t. created and w.id = 
v.id. The precondition of it in DVS-IMPL implies that v = s.cur p , which implies that 
v G s. created. Since w G t. created, the definition of J-*d„ s implies that w G s. attempted 
for some 5. This implies that u> 6 s. created. But then Invariant 4.1.1 implies that v = w. 
But this contradicts that fact that v £ t. created and w G t. created. 

To see requirement (ii), suppose that w G t. created and there is no x G s.TotAtt satisfying 
w.id < x.id < v.id or v.id < x.id < w.id. Since w G t. created, by definition of J-d„ s , 
w G s. attempted for some q. Clearly, w G s' .attempted . Therefore, u> G s'.Att. By the 
code of it we have that v G s'. attempted q . Therefore we also have v G s'.Att. Moreover, 
there is no x G s'.TbtAtt satisfying w.id < x.id < v.id or v.id < x.id < w.id. Then 
Invariant 5.2.17 implies that v.set n w.set 7^ 0, as needed to prove that it' is enabled in t. 

We now prove that it" is enabled in state t". The precondition of -it" requires that 
v G t" .created and v.id > t" .current-viewid\p\. The first condition is true because v 
is added to created by tt' . The second condition follows from the precondition of it in 
DVS-IMPL: The precondition of it implies that v.id > s. client-cur. id p . The definition of 
Tdvs implies that t.current-viewid\p] = s. client- cur. id p . Moreover, t" .current-viewid\p] = 
t.current-viewid\p\. It follows that v.id > t" .current-viewid\p\. Thus tt" is enabled in 
state t". 

11. TT =DVS-REGISTER p 

Then trace((s, tt, s')) = tt. Let g be s. client- cur. id p , which equals t.current-viewid\p] by the 
abstraction function. If g = ±, then tt has no effect in DVS-IMPL, so s = s'; thus t = t', as 
required to show that tt brings DVS from t to t'. Otherwise, g 7^ ±, so by the code in DVS- 
IMPL, this action sets reg[g] p to true and inserts a "registered" message into msgs-to-vs[g] p . 
By definition of Tdvs, t' is the same as t except that t' .registered[g] = t.registered[g] U {p}. We 
set a = (i,Dvs-REGisTER p , t'). It is easy to check that dvs-register p brings DVS-SYS from t to t' . 

12. TT =dvs-garbagecollect(^) p 

Then trace ((s,tt,s')) = X. This action can modify act p and amb p . The definition of Tdvs is 
not sensitive to these changes. Therefore, t = t', and we set a = t. 
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13. 7T =DVS-GPSND(m) p 

Then trace ((s,n,s')) = n. We set a = (t, dvs-gpsnd(jti)j,, t'). We consider two cases: 

(a) s. client- cur. id = ± 

Then s = s' . In this case, the definition of Tdvs implies that also t.current-viewid\p] = ±, 
which implies that the action also has no effect in t, which suffices. 

(b) s. client- cur. id 7^ ± 

In this case, the action appends m to msgs-to-vs[g\ p , where g = client- cur. id p . Hence 
we have that s' .msgs-to-vs[g] = s.msgs-to-vs[g]om. By the definition of Tdvs we get that 
t' .pending\p, g] = t.pending\p, g]om. This is the effect of the action in t (using the fact 
that t.current-viewid\p] ^ ±.) 

14. 7T = DVS-GPRCV(m) p 

Then trace ((s, it, s')) = n. This action removes the head of msgs-from-vs[g\ p , where g = 
cur.id p . We have that s.msgs-from-vs[g] p = mos' .msgs-from-vs[g] p . Thus t'.next\p,g] = 
t.next\p,g] + 1. We set a = (t,Dvs-GPRcv(m) p ,t'). It is easy to check that the step has the 
required effect in DVS-SYS. The fact that Dvs-GPRcv(m) p is enabled in t follows from Invari- 
ant 5.2.19. 

15. TV = DVS-SAFE(m) p 

Then traced) = n. This action removes the head of the safe-from-vs[g] p , where g = cur.id p . 
We have that s.safe-from-vs[g] p = mos 1 '.safe- from-vs[g] p . Thus t' .next-safe\p, g] = t.next-safe\p, g] + 
1. We set a = (t,Bvs-GPRcv(m) p ,t'). It is easy to check that the step has the required effect in 
DVS-SYS. The fact that dvs-gprcv(to) p is enabled in t follows from Invariant 5.2.21. 

D 

Lemmas 5.2.22 and 5.2.23 prove that Td vs is an abstraction function from DVS-IMPL to DVS and 
thus the following theorem holds. 

Theorem 5.2.24 Every trace of DVS-IMPL is a trace of DVS-SYS. 

5.3 An application of DVS 

In this section we show how to use dvs to implement a totally ordered broadcast service, called TO. 
In Section 5.3.1 we give the specification of the totally ordered broadcast service TO, in Section 5.3.2 
we describe the implementation, which we call to-impl, and in Section 5.3.3 we prove that to-impl, 
implements TO. 
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5.3.1 The TO service 

The TO service was originally defined in [41]. This service accepts messages from clients and delivers 
them to all clients according to the same total order. This kind of service is a building block for 
many fault-tolerant distributed applications. The specification is reproduced in Figure 5-5. 

The following is an informal description of the service. Processes can broadcast messages by 
means of actions BCAST(a) p . Such a message a is appended to a queue local to process p, pending\p\. 
The service establishes a totally order on the messages by means of action To-oRDER(a,p), which takes 
a message from the local queue of a process and puts it into a global queue. The order established 
by this global queue is the one used to deliver messages. The pointer next[q] points to the next 
message in the global queue to be delivered to process q by means of action brcv(o)j,,<j . 

TO 

Signature: 

Input: BCAST(a) v , a E A, p E V „ . . , . _ . _ ,_ 

Internal: to-order^ p),aEA,pEV ° UtpUt: m ™(a) p , q ,aEA,p,qEV 

State: 

queue 6 seqof(A x V), init A for each p E V : pending\p] E seqof(A), init A 

next\p] E N >0 , init 1 

Transitions: 

input BCAST(a)p output BRCV(a)p,q 

Eff: append a to pending\p] Pre: queue(next[q]) = (a,p) 



internal TO-ORDER(a,p) 
Pre: a is head of pending\p] 
Eff: remove head of pending [p] 
append (a,p) to queue 



Eff: next[q] := next[q] + 1 



Figure 5-5: The TO service. 

5.3.2 The implementation of TO 

We provide an implementation of TO using dvs as a building block. The implementation, which we 
call to-impl, consists of an automaton dvs-to-to p for each p£V, and the dvs specification. 

The implementation is similar to the TO implementation provided in [41]. Both algorithms rely on 
primary views to establish a total order of client messages. The main difference is that the algorithm 
in [41] uses a static notion of primary and the new one uses a dynamic notion. The algorithm of 
[41] is built upon a VS service that reports non-primary as well as primary views, and uses a simple 
local test to determine if the view is primary. That algorithm does some non-critical background 
work (gossiping information) in non-primary views. In contrast, the algorithm proposed here is built 
upon the DVS service, which only reports primary views. Thus the new algorithm is simpler in that 
it does not perform the local tests and does not carry out any processing in non-primary views. On 
the other hand, in the new algorithm, the application programs must perform dvs-register actions to 
tell the dvs service when they have "established" new views. Although the new algorithm appears 
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very similar to the one of [41], the fact that the dvs service provides weaker and more complicated 
guarantees than the vs service makes the new algorithm harder to prove correct. 

The TO-IMPL algorithm involves normal and recovery activity. Normal activity occurs while a 
group view is not changing. Recovery activity begins when a new primary view is presented by 
DVS, and continues while the members combine information from their previous history, to provide 
a consistent basis for ongoing normal activity. 

During normal activity, each client message received by TO-IMPL is given a system-wide unique 
label, which consists of a view identifier (the one of the view in which the message is received), a 
sequence number and the identifier of the process receiving the message. The association between 
client messages and their unique labels is recorded in a relation content and communicated to other 
processes in the same view using dvs. When a message is received, the label is given an order, a 
tentative position in the system-wide total order the service is to provide. When client messages 
have been reported as delivered to all the members of the view, by the "safe" notification of dvs, 
the label and its order may become confirmed. The messages associated with confirmed labels may 
be released to the clients in the given order. 

The consistent sequence of message delivery within each view keeps this tentative order consistent 
at members of a given view, but it may be not consistent between nodes in different views. To avoid 
inconsistencies processes need state exchange at the beginning of a new view. 

When a new primary view is reported by DVS, recovery activity occurs to integrate the knowledge 
of different members. First, each member of a new view sends a message, using dvs, that contains 
a summary of that node's state. The summary of a node's state contains the following information: 
the association of labels with client messages, stored in content, the order of client messages to be 
reported to the clients, stored in order, a pointer to the next client message to be confirmed, stored 
in nextconfirm and the view identifier of the primary view with the highest view identifier in which 
the order sequence has been modified (stored in highprimary). 

Once a node has received all members' state summaries, it processes the information in one 
atomic step, i.e., it establishes the new view. Once a node establishes a view, it informs dvs of 
that fact with a dvs-register action. The node processes state information as follows: it defines 
its confirmed labels to be the longest prefix of confirmed labels known in any of the summaries; 
it determines the representatives as the members whose summary include the greatest highprimary 
value; adopts as its new order the order of a "chosen" representative (the chosen representative is 
arbitrary but must be the same for all processes) extended with all other labels appearing in any of 
the received summaries, arranged in label order. 

Then recovery continues by collecting the dvs safe indications. Once the state exchange is safe, 
all labels used in the exchange are marked as safe and all associated messages are confirmed just as 
in normal processing. 
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DVS-TO-TO 



Signature: 

Input: BCAST(a)„, a £ A „ . 

V ,P , -, ,_ „ ^ /,,,£. Output: DVS-REGISTERj, 

DVS-SA F E( m ) p a £ V,ra ecus *■ 

v '^' Internal: confirm,, 



State 



nextreport £ N >0 , init 1 



current £ Vj_, init ^o if p £ Po, -L else „ 

status £ -f normal, send, collect}, init normal . \ . . \. 9l „ . . „ 

„ c . . „ gotstate, a partial function from y 3 to <S, mit 

content £ -^ , mit W ^ . ^ _ . . ,* 

* ,- tvt>o • --t 1 safe-exch C >> mit ((I 

nextseqno £ ]NK U , mit I , - „ . ., r , .„ ,_ _ A . 

fcjy/fpr P „Wf£l init A registered C g, mit {go} if p £ P , else 

f , f J ToC ■ ■? fl de '^ e seqof(A) , init A 



*a. 



n2L^rTe ( N>o m in?t 1 MtaWwAedb], a bool, init true if g = 50 ,P £ Pb, 

Figure 5-6: The dvs-to-to p code. 

For the code, shown in Figures 5-6 and 5-7, we need the following definitions. £ = Q x N >0 x P 
is the set of labels, with selectors Lid, l.seqno and I. origin. A is the set of messages that can be 
sent by the clients of the TO service. C = L x A is the set of possible associations between labels 
and client messages. S = 2 C x seqof(L) x N >0 x Q is the set of summaries, with selectors x.con, 
x.ord, x.next and x.high. Given x € 5, x. confirm is the prefix of x.ord such that \x.confirm\ = 
min(x.next — 1, |x.ord|). If Y" is a partial function from processor ids to summaries, then we define: 
knowncontent(Y) = u'q£dom(Y)Y(q).con, 
maxprimary(Y) = max qedom ( Y ){Y(q).high}, 
maxnextconfirm(Y) = max ge( jom(Y) Y(q).next, 
reps(Y) = {q € dom(Y) : Y(q).high = maxprimary}, 
chosenrep(Y) = some element in reps(Y), 
shoriorder(Y) = Y(chosenrep(Y)).ord, and 

fullorder(Y) = shortorder(Y) followed by the remaining elements of dom(knowncontent(Y)), in 
label order. 

The system to-impl is the composition of all the dvs-to-to p automata and dvs with all the 
external actions of dvs hidden. 

The allstate, allcontent and allconfirm derived variables are defined for TO-IMPL as follows (this 
is as in [41]). 

We write allstate\p, g] to denote a set of summaries, defined so that x G allstate\p, g] if and only 
if at least one of the following hold: 

1. current. id p = g and x = (content p , order p , nextconfirm , highprimary ). 

2. x G pending\p, g\. 

3. (x,p) € queue[g]. 
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DVS-TO-TO (cont'd) 



Transitions: 

input BCAST(a) p 
EfF: append a to delay 

internal LABEL(a) p 
Pre: a is head of delay 

current ^ _l_ 
Eff: let I be (current. id, nextseqno, p) 

content := content U {(I, a)} 

append I to buffer 

nextseqno := nextseqno + 1 

delete head of delay 

output DVS-GPSND({/,a)) p 
Pre: status = normal 
I is head of buffer 
{I, a) £ content 
Eff: delete head of buffer 

input DVS-GPRCV({/,a))g jP 
Eff: content := content U {(I, a)} 
order := orderol 

input dvs-safe({/, a))g yP 
Eff: safe-labels := safe-labels U {/} 

internal CONFIRMp 
Pre: order (nextconfirm) G safe-labels 
Eff: nextconfirm := nextconfirm + 1 

output BRCV(a)g ]P 
Pre: nextreport < nextconfirm 

{order (nextreport) , a) £ content 
q = order {nextreport). origin 
Eff: nextreport := nextreport + 1 



input DVS-NEWVIEW(t;)p 

Eff: current := v 
nextseqno := 1 
buffer := A 
gotstate := 
safe-exch := 
safe-labels := 
status := send 

output dvs-gpsnd(x) p 
Pre: status = send 

a: = {content, order, nextconfirm, highprimary) 
Eff: status := collect 

input DVS-GPRCV(x)q jP 

Eff: content := content Ux.con 
gotstate := gotstate ® (g, x) 
if {dom {gotstate) = current. set) A{status = collect) then 

nextconfirm := maxnextconfirm{gotstate) 

order := fullorder {gotstate) 

highprimary := current.id 

status := normal 

established[current.id] := true 

output DVS-REGISTERp 

Pre: current ^ _l_ 

este W is/jed [curreni. id] 

current.id £ registered 
Eff: registered := registered U {current. id} 

input dvs-safe(x) 9 ,j, 
Eff: safe-exch := safe-exch U {g} 
if safe-exch = current. set then 
safe-labels := safe-labels U range{fullorder{gotstate)) 



Figure 5-7: The dvs-to-to p code (cont'd). 

4. For some q, current.id q = g and x = gotstate (p) q . 

Thus, allstate\p, g] consists of all the summary information that is in the state of p if p's current view is 
g, plus all the summary information that has been sent out by p in state exchange messages in view g 
and is now remembered elsewhere among the state components of to-impl. Notice that allstate\p, g] 
consists only of summaries: an ordinary message (/, a) is never an element of allstate\p, g\. We write 
allstate[g\ to denote U»eP allstate\p, g], and allstate to denote Uq6G allstate[g]. 

We write allcontent for UxGaMsiaie x.con U {(/, a) : 3g,p : ((l,a),p) G range (queue[g]) V (I, a) e 
range(pending\p,g])}. This represents all the information available anywhere that links a label with 
a corresponding data value. 

We write allconfirm for lub xea u s t a te(x. confirm) . 

For every p £ P, g £ G, buildorder\p, g] is defined to be a sequence of labels, initially empty; this 
variable is maintained by following every statement of processor p that assigns to order with another 
statement buildorder\p, current. id p ] := order. It follows that if p establishes a view with id g, and 
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later leaves view g for a view with a higher view identifier, then forever afterwards, buildorder\p, g] 
remembers the value of order p at the point where p left view g. 

5.3.3 Proof that to-impl implements TO 

The correctness proof for to-impl follows the outline of the one in [41]. The main difference is that 
the main invariant, which corresponds to Lemma 6.18 of [41], requires a different, more subtle proof. 
We first provide some auxiliary invariants. 

Invariant 5.3.1 (to-impl) 

In any reachable state, if p G DVS. registered[g] then established[g] p . 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, DVS. registered [g] is empty for all g except 
for g = go for which dvs. registered[go] = Po- So assume that g = go and p G P. In the initial state 
established[g] p = true if g = go and p G po- Hence the invariant is true. 

For the inductive step, assume that the invariant is true in a reachable state s. We need to prove 
that it is true in s' for any possible step (s, n, s'). Fix g and p. The hypothesis changes from false 
to true only in 7t=dvs-register p and s.current p = g, and the action's precondition (in DVS-TO-TO) 
shows the conclusion. The conclusion never changes from true to false. □ 

Invariant 5.3.2 says that any view that is known (anywhere in the system state) to be an estab- 
lished primary was in fact attempted by all its members. 

Invariant 5.3.2 (to-impl) 

In any reachable state, if x G allstate then there exists w G created such thatx.high = w.id, and for 

all p G w.set, p G attempted[w.id]. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, the invariant follows from the definition of 
allstate (set w = current. id p ). 

For the inductive step, assume that the invariant is true in a reachable state s. We need to prove 
that it is true in s' for any possible step (s,n,s'). The only step that we have to worry about is 
when a new summary is created. When a new summary x is created, x.high is set to the identifier 
of the current view, and a message has been received from everyone in the membership. [] 

Invariant 5.3.3 says that if a view w is established, then no earlier view v can still be active. 

Invariant 5.3.3 (to-impl) 

In any reachable state, if ' v G created, x G allstate and x.high > v. id then there exists p G v. set with 

current. id p > v.id. 
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Proof: Fix v, x as given. Lemma 5.3.2 shows the existence of w G created such that x.high = w.id, 
and for all p G w.set, p G attempted[w.id]. Then Invariant 5.1.4 implies that there exists p G f. set 
with current-viewid\p] > v. id. But current-viewid\p\ = current.id p , which yields the result. □ 

Finally we provide the proof for the invariant corresponding to the invariant stated in Lemma 
6.18 of [41]. This invariant has a more subtle proof than the one given in Lemma 6.18 of [41]. That 
proof uses the strong intersection property among primary view membership (in the implementation 
of [41] each primary view intersects each other primary view). The proof in [41] does not work in the 
setting of DVS because DVS guarantees a weaker intersection property (each primary view intersects 
only the primary views in between the preceding and the following totally registered primary views). 
The new proof also uses the fact about dvs that once a view is attempted at all processes in its 
membership set, no views with lower identifiers can become established. 

Invariant 5.3.4 (toimpl) 

In any reachable state, suppose thatv G created, a G seqof(jC), and for every p G v. set, the following 

is true: If current. id p > v.id then established[v .id] p and a < buildorder\p,v.id]. 

Then for every x G allstate with x.high > v.id, we have that a < x.ord. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, the only created view is vo, and there is no 
x G allstate with x.high > go. So the statement is vacuously true. 

For the inductive step, assume that the invariant is true in a reachable state s. We need to prove 
that it is true in s' for any possible step (s,ir, s'). So fix v G s 1 .created and a, and assume that for 
every p G v. set, if s 1 . current. id p > v.id then s 1 .established[v.id] p and a < s 1 .buildorder\p,v.id]. 

If v ^ s. created, then 7r must be createview(^). Then s 1 .established[v.id] p = false for all p. Fix 
x G s'. allstate and suppose that x.high > v.id. Then Invariant 5.3.3 applied to s' implies that there 
exists p G v.set with s'. current. id p > v.id; fix such a p. Then the hypothesis part of the invariant 
for s' implies that s' .established[v.id] p = true, a contradiction. It follows that v G s. created. 

As usual, the interesting steps are those that convert the hypothesis from false to true, and those 
that keep the hypothesis true while converting the conclusion from true to false. 

In this case, there are no steps that convert the hypothesis from false to true: If there is some 
p G v.set for which s.current.id p > v.id and either s.established[v.id] p = false or a is not a 
prefix of s.buildorder\p, v.id], then also s 1 . current. id p > v.id (the id never decreases) and either 
s' .established[v.id] p = false or a is not a prefix of s' .buildorder\p, v.id]. (These two cases carry over, 
since s.current.id p > v.id implies that established[v.id] p and buildorder\p,v.id] cannot change.) 

So it remains to consider any steps that keep the hypothesis true while converting the conclusion 
from true to false. Thus, we assume that if s.current.idp > v.id then s.established[v.id] p and 
a < s.buildorder\p,v.id]. Suppose that x G s 1 . allstate and x.high > v.id. If also x G s. allstate then 

71 



we can apply the inductive hypothesis, which implies that a < x.ord, as needed. So the only concern 
is with steps that produce a new summary. 

Any step that produces the new summary x by modifying an old summary x 1 £ s.allstate, in such 
a way that x'.ord < x.ord and x' .high = x.high, is easy to handle: For such a step, x' .high > v. id 
and so the inductive hypothesis implies that a < x'.ord < x.ord, as needed. So the only concern is 
with gprcvj, steps that produce a new summary x by delivering the last state-exchange message of 
a view w to some processor p. Thus x.high = w.id. Let x' be the summary of q' = chosenrep in 
s' .gotstate. We claim x' .high > v.id. 

To prove the claim, we let v' denote the unique element with highest viewid among the elements 
of s 1 .created such that v' .id < w.id and s' .registered[v' .id] = v' .set. Let v" denote either v' or v, 
whichever has the higher viewid. Invariant 5.1.3 shows that w.set (1 v" .set ^ 0, no matter whether 
v" is v or v'. Fix any element q" in w.set n v" .set. 

Recall that the condition for establishing a view shows that domain(s' .gotstate ) = w.set, so by 
the code, either q" £ domain(s.gotstate p ) or else q" is the sender of the message whose receipt is 
the step we are examining. In the former case, let x" be the summary s.gotstate(q") p ; in the latter 
let x" be the summary whose receipt is the event. In either case we have x" £ s.allstate[q" ,w.id]. 

We now show that s.established[v" .id] q » . We consider two cases: 

1. v" = v'. 

Then q" £ v' .set so by definition of v' , we obtain q" £ s.registered[v' .id]; therefore, we have 
that s . established [v'. id] q « . 

2. v" = v. Because s.allstate[q",w.id] is non-empty, the analogue of Part 4 of Lemma 6.7 from 
[41] implies that s. current. id q « > w.id. We have that x.high > v.id by assumption, and 
x.high = w.id by the code; therefore, w.id > v.id. So also s. current. id q » > v.id. Recall that 
we are in the case where the hypothesis of this lemma is true. Therefore, by this hypothesis 
(uses q" £ v.set), we obtain that s.established[v.id] q " 

By the analogue of Lemma 6.14 from [41], (applied with q" replacing^) we obtain x" .high > v" .id. 
By the definition of q' as a member that maximizes the high component in the summary recorded 
in s' .gotstate, we have x' .high > x" .high. Therefore x' .high > v" .id > v.id, completing our proof of 
the claim. 

If x' .high > v.id then we can apply the induction hypothesis to x' and we are done, since x'.ord < 
x.ord. So suppose x' .high = v.id. Note that x' £ s.allstate[q' ,w.id]. By an analogue of Lemma 
6.16 from [41] there must exist 1 q £ v.set so that s.established[v.id] q , x'.ord = s.buildorder[q,v.id], 
and (either x' .high = w.id or s. current. id q > v.id). Since x' .high = v.id < x.high = w.id, the 



1 Direct application of the lemma actually shows the existence of some v and q £ v.set, but since x' .high = v. 
and also x' .high = v.id, uniqueness of viewids shows we may take v to be v itself. 
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last property can be simplified to s.current.id q > v. id. By monotonicity of current, we have 
s' .current q > v. id. The hypothesis of this lemma says that this forces a < s' .buildorder[q,v.id]. 
Since x'.ord < x.ord by the code for this event, and x'.ord = s.buildorder[q,v.id] as shown above, 
and s.buildorder[q,v.id] = s' .buildorder[q,v.id] since q is not currently in view v, we get a < x.ord, 
which is what we need. □ 



Let s be a state of to-impl. The state t = Fto{s) of TO is the following. 

1. t. queue = apply all ((s. all content, origin), s.allconfirni), 

where the selector origin is regarded as a function from labels to processors. 

2. t.next\p] = s. next-report . 

3. t.pending\p] = apply all (s.allcontent, b) ■ s. delay where b is the sequence of labels such that 

(a) range(b) is the set of labels I such that I. origin = p, {I, a) € s.allcontent for some a, 
and 

I ^ range(s.allconfirm). 

(b) b is ordered according to the label order. 



Figure 5-8: The abstraction function F to . 

To complete the implementation proof, we define a function from the reachable states of to-impl 
to the states of TO and prove that it is an abstraction function. This function is defined exactly as 
in [41]. Figure 5-8 shows the abstraction function F to . The proof that F to is an abstraction function 
is as in [41]. Since Tto is an abstraction function we have the following theorem. 

Theorem 5.3.5 Every trace of TO-IMPL is a trace of TO. 

5.4 Remarks 

The safe indications provided by the DVS service are crucial to the application: commitments about 
the total order are made only when receiving safe notifications for particular messages. This is a 
common point for any application that needs to preserve strong data coherence. For such appli- 
cations, no commitments about the shared data can be made before safe indications are delivered. 
However the application can still perform some useful work while waiting for a safe indication. For 
example, it can pre-compute the value of the shared data so that when the safe indications arrive 
little processing will be needed (of course such computation is wasted if the safe indication never 
arrives); it can do optimistic updates to the shared data assuming that the safe indications will 
arrive (in this case roll back is required if the safe indications do not arrive). 

The total order service that we have developed in this chapter can be built also using a sequence 
of executions of a consensus algorithms (e.g., the multipaxos algorithm of [61] 2 ). The advantage 



2 The name multipaxos is actually used in [29]. The original paper by Lamport [61] uses a different name (multi- 
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of the approach taken here is that we use a building block, the dvs group communication service, 
which may also be used for other applications. For example, in the next chapter, we will use a 
variant of DVS to build two applications on top of it, an atomic multi- reader multi- writer shared 
register and a dynamic version of the PAXOS algorithm. 

This work deals entirely with safety properties; it remains to consider performance and fault- 
tolerance properties as well. Future work also include investigation of other applications of our DVS 
specification, such as replicated data applications and load-balancing applications. 

Another interesting exploration direction considers variations on the DVS specification, for ex- 
ample, one in which the state exchange at the beginning of a new view is supported by the dynamic 
view service. Also, one could provide variations on our specifications that are more specifically 
tuned to systems like Isis and Ensemble. For example, a variation could require that processes that 
move together from one view to the next receive exactly the same messages in the first view. This 
guarantee is offered by Isis and Ensemble. 

In the next chapter we provide a generalization of the dvs service to configurations. This 
generalization will also include support for state transfer at the beginning of a new configuration. 



decree parliament protocol). 
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Chapter 6 



The DC service 



In this chapter we generalize the notion of dynamic primary view to that of dynamic primary config- 
uration. We present DC, a specification for a dynamic primary configuration group communication 
service. Section 6.1 provides the DC specification, Section 6.4 provides an implementation of DC, 
and finally Section 6.5 describes an application that uses DC as a building block. 

6.1 Overview 

The DC specification is similar to the dvs specification; the difference is that it provides the clients 
with configurations instead of views. Like DVS, the DC specification is dynamic and provides primary 
configurations. The main difficulty here is that a notion of dynamic primary configuration needs to 
be developed (the notion of dynamic primary view has been studied in several papers, e.g. [55, 89]). 
In this chapter we develop such a notion and we define the DC service, which provides dynamic 
primary configurations to its clients. 

Primary configurations must satisfy certain intersection properties with previous primary con- 
figurations. The type of configurations that we consider in this chapter is the read-write-quorum 
configurations (see Section 3.1.2). The intersection property that we require is that that the mem- 
bership set of a new primary configuration must include the members of at least one read quorum 
and one write quorum of the previous primary configuration. The DC specification provides to the 
client only configurations satisfying this property. 

Change of configurations might be driven either by change in the underlying physical distributed 
system or by the applications running on top of the system (e.g., a new quorum system could be 
installed on the same membership set). 

An important feature of the DC specification is that it incorporates a state-exchange at the 
beginning of a new primary configuration. State-exchange at the beginning of a new configuration is 
required by most applications. When a new configuration is issued each member of the configuration 
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is supposed to submit its current state to the service. Once having obtained the state from all 
the members of the configuration, the DC service computes the most up to date state over all 
the members, called the starting state. The starting state is then delivered to each member of 
the configuration. This way, each member begins regular computation in the new configuration 
knowing the starting state. We remark that this is different from the approach used by the dvs 
service which lets the members of the configuration compute the starting state. Some existing 
group communication services also integrate state-exchange within the service, e.g., [82, 19, 86], 
some others do not, e.g., [33, 73, 38, 41]. The Transis system [33] can be augmented with a layer 
providing state-exchange [5]. 

The DC specification offers a broadcast/convergecast communication mechanism. This mecha- 
nism involves all the members of a quorum, and uses a condenser function to process the information 
gathered from the quorum [66]. More specifically, a client that wants to send a message to the mem- 
bers of its current configuration submits the message together with a condenser function to the 
service; then the DC service broadcasts the message to all the members of the configuration and 
waits for a response from a quorum (the type of the quorum, read or write, is also specified by the 
client); once answers are received from a quorum, the DC service applies the condenser function to 
these answers in order to compute a response to give back to the client that sent the message. Such 
a series of actions should be seen as performing an operation requested by the client; executing the 
operation requires the participation of a quorum of the processes. 

We remark that this kind of communication is different from those of the vs service [41] and 
the dvs service. Instead, it is like the one used in [66]. We integrate it into DC because we want 
to develop a particular application that benefits from this particular communication service (a 
read/write register as is done in [66]). 

6.2 The DC specification 

Prior to providing the code for the DC specification, we need some notation and definitions, which 
we introduce in the following. 

Let OID be a set of operation identifiers, partitioned into sets OID p , p £ V. We denote by 
M. c C M. the set of messages that clients may use for communication. 

Let A be a set of "acknowledgment" values and let 11 be a set of "response" values. A condenser 
function is a function from (V — > A±) to 1Z. Let $ be the set of all condenser functions. Let S 
be the set of all possible states of the clients (a state of S does not need to be the entire client's 
state, but it may contain only the relevant information in order for the application to work). The 
DC specification uses a condenser function also to compute the starting state of a new configuration; 
hence we assume that S C A and also S C TZ. Given a function / :V — > D from the set of processes 
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V to some domain D and given a subset P C V, we write f\P to denote the function f : P —¥ D, 
denned as f'(p) = f(p) for p € P. 

The following data type is used to describe operations: V = M. x $ x {"read", "write"} x2 p x 
(■p — > ,4_l) x Bool and we let O = OID — > £>j_. Given an operation descriptor, selectors for the 
components are msg, end, sel, dlv, ack, and rsp. 

The code of the DC specification is given in Figure 6-1. 

Next we provide remarks and an informal description of this code. We start with the derived 
variables. 

A configuration c € Att is said to be attempted. For an attempted configuration c there exists at 
least one process p that has executed action newconf(c) p and thus we have that p e attempted[c.id]; 
when this holds we say that c is attempted at p or that p has attempted c. A configuration c e Tot Att 
is said to be totally attempted. A totally attempted configuration is a configuration that is attempted 
at all members of the configuration. 

A configuration c G Est is said to be established. For an established configuration c there exists at 
least one process p that has executed action newstate(s) p and thus we have that p € state- dlv[c. id]; 
when this holds we say that c is established at p or that p has established c. A configuration 
c € IbtEst is said to be totally established. A totally established configuration is a configuration that 
is established at all members of the configuration. 

A dead configuration c is a configuration for which a member process p went on to newer config- 
urations, that is, it executed action newconf(c') p with c'.id > c.id, before receiving the notification, 
that is the newconf(c) p event, for configuration c. 

Now we comment on the transitions. 

Action createconf(c) creates a new configuration c. The first precondition simply requires this 
new configuration to have a brand new identifier. The second precondition of this action is the key 
to our specification. It states that when a configuration c is created it must either be already dead or 
for any other configuration w such that there are no intervening totally established configurations, 
the earlier configuration (i.e., the one with smaller identifier) has at least one read quorum and one 
write quorum that are subsets of the membership set of the later configuration (i.e., the one with 
bigger identifier). 

Action newconf(c) p delivers a created configuration c to the client process p. The precondition 
of this action makes sure that configurations are delivered in order of configuration identifiers. We 
notice that because of this precondition, when a configuration c is dead because a process q went on 
to newer configurations, we have that process q can no longer execute action NEwcoNF(c) g . 

Once a configuration c has been delivered to a client process p, the client process p is supposed 
to submit its current state s and a condenser function <f>, by means of action submit-state(s,</>) p . Once 
all the processes have submitted their current states, the condenser function <j> is used to compute 
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DC 

Signature: 

Input: sUBMlT(m, </>, 6, i) p , m £ M c , </>£*, 

b £ {"read", "write"}, p eV,ie OID p 
ACKDLVR(a, i) p , a £ A, i £ OID, p £ V 
submit-state(s,</)) p , s £ 5, 4> e * 

State: 

created £ 2 C , init {co} 
for each p E V: 

cur-cid\p] £ gj_, init go if p £ Pq, _L else 
for each g £ Q: 

attempted[g] £ 2' p , init Po if g = go, {} else 

Derived variables: 

Att £ 2 C , defined as {c £ created\attempted[c.id] ^ 0} 
£st £ 2 C , defined as {c £ created\state-dlv[c.id] ^ 0} 
dead £ 2 C defined as dead = {c £ C|3p £ c.set : cur-cit 



Internal: createconf(c), c £ C 
Output: newconf(c) p , c £ C, p £ c.set 
newstate(s)j,, s £ S 
RESPOND(a, i) p , a e A,i e OID v , p ev 
DELiVER(m, i) p , m £ M c , i £ OID, p £ f 



for each g £ Q: 

got-state[g] £ 7^ — > 5j_, init everywhere _L 
conde«ser[g] £ *j_, init everywhere _l_ 
state-dlv[g] £ 2 7 ', init Po if g = go, {} else 
pending[g] £ 0, init everywhere _l_ 



IbtAtt £ 2 C , defined as {c £ crea<erf|c.set C attempted[c.id]} 
TbtSst £ 2 C , defined as {c £ creaferf|c.set C stefe-d/u[c.id]} 
'p > c.id and p ^ attempted[c.id]}. 



Actions: 

internal CREATECONF(c) 
Pre: Vw £ created : c.id ^ w.id 
if c £ dead then 
Vto £ created, w.id < c.id: 
«; £ dead or 

(3x £ lbt£st: w.id < x.jd < c.id) or 
(3K £ w.rqrros^VK £ w.wqrms: 
RUW C c.set) 
\/w £ created, w.id > c.id 
«; £ dearf or 

(3x £ lbt£st: c.id < x.id < w.id) or 
(3R £ c.rqrms,3W £ c.wqrms: 
RUW C TO.sei) 
Eff: created := created U {c} 

output NEWCONF(c)p, p £ c.set 
Pre: c £ created 

c.id > cwr-cid[p] 
Eff: cwr-cjd[p] := aid 

attempted[c.id] := attempted[c.id] U {p} 

input submit-state(s,</>)p 
Eff: if cur-cid[p] / _l_ and 

got-state[cur-cid\p]](p) = _l_ then 
got-state[cur-cid\p]](p) := s 
co«de«ser[ci(r-cid[p]](p) := rf 

output newstate(s)p choose c 
Pre: c.id = cur-cid\p] 

c £ created 

\/q £ c.set, got-state[c.id](q) ^ _l_ 

let / = condenser[c.id](p)\c.set 

s = f(got-state[c.id]) 

p £ state- dlv[c.id] 
Eff: state-dlv[c.id] := state- dlv[c.id] U {p} 



input SUBMIT(m,</),6, i)p 
Eff: if cur-cid\p] =£ _L then 
pending[cur-cid\p]](i) 
:= (m,4>,b, 0, A(x) : x 



_1_, false) 



output DELlVER(m, i) p choose g 
Pre: g = cur-cid\p] 

p ^ pending [g](i).dlv 
pending[g](i).msg = m 
Eff: pending [g] (i).dlv := pending [g] (i).dlv U {p} 

input ACKDLVR(a, j)p 
Eff: if cur-cid[p] / _L and 

pending[cur-cid\p]](i) .ack(p) / _l_ then 
pending[c«r-cid[p]](i).acfc(p) := a 

output RESPOND(r, j)p choose c,Q 
Pre: c.id = cur-cid\p] 
c £ created 
i £ 0/.Dp 

pending [c.id](i).rsp = false 
if pending [c. id], set = "read" 

then Q £ c.rqrms 
else Q £ c.wqrms 
let / = pending [c.id] (i).ack 

r = pendmg[c.Jd](i).cnd(/|Q) 
Eff: pendmg[c.ia!](j).rsp := true 



Figure 6-1: The DC specification 
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the starting state of configuration c for process p. The code of this action just memorizes the state 
s and the condenser function <j) for the current configuration of process p. 

Action newstate(s) p computes the starting state for a configuration c. The precondition of this 
action requires that all processes q in the membership of configuration c have submitted their state 
for configuration c. The starting state s of configuration c for process p is then computed by applying 
the condenser function that process p has submitted to the service with action submit-state(s,0) p . 
Variable state-dlv[c.d\ records the fact that p has received the starting state for configuration 

We remark that for a dead configuration c there is at least one process that does not execute 
action newconf(c) p and thus does not submit its state for c with action submit-state(s,</>) p . This implies 
that action newstate(s) 9 cannot be executed for any process q. This is why such configurations are 
called "dead". 

The remaining actions are used to handle the requests of clients. We refer to the process of han- 
dling such a request, which involves the participation of a quorum of processes, as an "operation" . 
To request the execution of an operation a client process p uses action submit(to, 4>,b, i) v . The param- 
eters of this actions are as follows: m is a message describing the operation that p needs to perform 
(e.g., read a register, write a register); ^ is a condenser function to be used to compute a response 
value for p when a quorum of processes have provided acknowledgment values to p's message m; b is 
just a selector indicating whether to wait for acknowledgment values from a write quorum or from a 
read quorum; i is an operation identifier needed to distinguish operations (every requested operation 
has a unique operation identifier). We say "operation i v to indicate the operation requested with 
action submit(to,0,6, i) p . For configuration c and operation i, the variable pending[c.id](i) contains 
an operation descriptor; The code of action submit^, 4>,b,i) p sets to a default value this operation 
descriptor. 

We now provide an explanation for each component of an operation descriptor. Let d be an 
operation descriptor for operation i requested by p in configuration c. d.msg is the message that 
describes the request of p; such a message will be delivered to all members of the configuration 
c. d.cnd is the condenser function that will be used to compute the response for the operation 
once a quorum of processes has provided acknowledgment values, d.sel is the selector that specifies 
whether to use a read or a write quorum, d.dlv is the set of processes to which the message d.msg 
has been delivered; initially this is set to an empty set by action submit^, </>, 6, i) p . d.ack contains 
the acknowledgment values received; initially this is a vector of ± values. Finally, d.rsp is a flag 
indicating whether or not the client p, which requested the operation, has received a response for 
the operation. 

Action DELivER(m, i) p delivers the message m of operation i to process p. The code of this action 
updates the operation descriptor d for operation i by adding process p to the set d.dlv. 

Processes that receive the message m for an operation i are supposed to provide an acknowledg- 
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ment value a with action ACKDLVR(a, i) v . The code of this action records the acknowledgment value a 
of process p into the vector d.ack, where d is the operation descriptor for operation i. 

Finally, action respond(t-, i) provides a response r to process p for the operation i previously sub- 
mitted by p. The precondition of this action requires that a quorum Q has provided acknowledgment 
values (the type of the quorum depends on the selector provided at the time of the operation sub- 
mission). Then the value r is computed by applying the condenser function provided by p at the 
time of the submission, to the acknowledgment values of processes in Q. At this point the operation 
has been serviced and the rsp component is set to true. 

6.3 Invariants of DC 

In this section we provide invariants of DC. These invariants are used to prove the correctness of the 
application that we build on top of DC. 

Invariant 6.3.1 In any reachable state of DC, the following is true. Letc\,c-i G created\ dead , with 
C\.id < c-i.id. Then either exists w € TbtSst, C\.id < w.id < c-i.id, or else there exist R, W, quorums 
of C\ such that RUWC c-i.set. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state the invariant is vacuously true because there 
are no two configurations C\ , c-i € created such that C\ .id < c-i .id. 

For the inductive step assume that the invariant is true in a state s. We need to prove that 
the invariant is true in s 1 for any step (s,ir, s'). The only action that we need to worry about is 
createconf(c), where c = C\ or c = C2, because it creates a new configuration (otherwise the invariant 
is true in s' by the inductive hypothesis). So assume that n =createconf(c). The invariant follows 
easily from the precondition of n. □ 

We remark that the intersection property stated in the above invariant may not hold for dead 
configurations. However, in a dead configuration is not possible to make progress because for such 
a configuration there is at least one process that will not participate and thus the configuration will 
never become established. 

The need for considering dead configurations comes from the implementation of the specification 
that we provide. It is possible to give a stronger version of DC by requiring that the intersection 
property in the precondition of action createconf holds also for dead configurations. We do not 
know if this stronger version is implementable. 

Invariant 6.3.2 In any reachable state of DC ; the following is true. If p G attempted[g] then 
cur-cid\p] > g. 



80 



Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state we have that attempted[g] is P$ if g = go and 
otherwise. Moreover for each p G Po we have that cur-cid\p] = go. Hence the invariant is true. 
For the inductive step fix g and p and assume that the invariant is true in a state s. We need to 
prove that the invariant is true in s 1 for any step (s, n, s 1 ). The actions that can make the invariant 
false are those that either put p into attempted[g] or modify cur-cid\p\. Hence we need only to 
worry about action newconf(c). So assume that 7t=newconf(c). The invariant follows easily from the 
precondition of tt. □ 

Invariant 6.3.3 In any reachable state of DC, the following is true. Ifc G created\dead , w G IbtAtt, 
and w.id > c.id, then there exists R G c.rqrms such that for all p G R, cur-cid\p] > c.id. 

Proof: Consider any particular reachable state. Assume that c G created \ dead, w G IbtAtt, and 
w.id > c.id. Then let y be the configuration in IbtAtt having the smallest identifier strictly greater 
than c.id. Note that y ^ dead, since a dead configuration cannot be totally attempted. Then there 
is no x G IbtSst with c.id < x.id < y.id. Then Invariant 6.3.1 implies that for some R G c.rqrms, 
R C y.set. Let p be any element of R. Since y G IbtAtt we have that p G attempted[y.id]. By 
Invariant 6.3.2 we have cur-cid\p] > y.id . Since y.id > c.id we have cur-cid\p] > c.id. □ 

6.4 An implementation of DC 

In this section we provide an algorithm that implements, in the sense of trace inclusion, the DC 
specification. The algorithm is built on top of the cs service and uses ideas from [88]. 

6.4.1 Overview 

The implementation of DC that we provide in this section is similar to the implementation of DVS 
provided in Chapter 5. However, there is a key difference in the implementation of DC compared to 
that of dvs. This difference provides new insights for the dvs specification and implementation, as 
we explain below. 

The DVS specification requires a global intersection property which is the following: given two 
primary views w and v with no intervening totally established view 1 , we must have that w.set U 
v.set ^ 0. The DVS implementation, when delivering a new view v, checks a stronger property locally 



1 "Established" views are called "registered" views in Chapter 5. This is due to the fact that the dvs specification 
requires client processes to "register" a new view when they have obtained enough information to begin regular 
computation; in the DC service this is handled by the service itself. However the meaning of "established" is the same 
as that of "registered", that is, a client process has got the information needed to proceed with regular computation. 
We use a different name just to emphasize the fact that in dvs clients need to "register" views while in DC configurations 
become "established" under the control of the service. 
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to the processes, which requires that \v.set U w.set\ > \w.set\/2 for all the views w, w.id < v.id, 
known by the process performing the check. 

The DC specification requires a global intersection property which is the following: given two 
primary configurations, both of which are not dead, with no intervening totally established configu- 
rations, then there must exist a read and a write quorum of the configuration with a smaller identifier 
which are included in the membership set of the configuration with bigger identifier. The DC imple- 
mentation checks the same property locally to each process. The intuitive reason why by checking 
locally the same property we can prove it also globally is that we exclude dead configurations. This 
suggests that also for dvs we can prove the stronger intersection property (the one checked locally) 
or we can use a weaker local check (the intersection required globally) if we do exclude dead views. 

The DC specification is built upon a static configuration service, called cs. This service is basically 
the vs service adapted to handle configurations (see Section 4.2). 

The automaton cs-to-dc p is given in Figures 6-2 and 6-3. The overall system dc-impl is defined 
as the composition of cs and cs-to-dc p , for p € V. 

Automaton cs-to-dc p uses special messages, tagged either with "info" , used to send informa- 
tion about the active and ambiguous configurations, or with "got-state", used to send the state 
submitted by a process to all the members of the configuration. The former information is needed 
to check the intersection property that new primary configurations have to satisfy according to the 
DC specification. The latter information is needed in order to compute the starting state for a new 
configuration. Thus, we use M = M c U {("info" x V x 2 V )} U {"got- state"}, where M c is the set of 
all client messages and M. is the universe of all messages. 

The major problem is that the DC specification requires a global intersection property (i.e., a 
property that can be checked only by someone that knows the entire system state) , while each single 
process has a local knowledge of the system. So, in order to guarantee that a new configuration 
satisfies the requirement of DC, each single process needs information from other processes members 
of the configuration. 

Informally, the filtering of configurations works as follows. Each process keeps track of the latest 
totally established configuration, called the "active" configuration, recorded into variable act, and 
a set of "ambiguous" configurations, recorded into variable amb, which are those configurations 
that were notified after the active configuration but did not become established yet. We define 
use = act U amb. When CS provides a new configuration to process p by means of action cs- 
newconf(c) p , process p sends out an "info" message containing its current act p and amb p values to all 
other processes in the new configuration, using the CS service, and waits to receive the corresponding 
"info" messages for configuration c from all the other members of c. After receiving this information 
(and updating its own act p and amb p accordingly), process p checks whether c has the required 
intersection property with each view in the use p set. If so, configuration c is given in output to the 
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client at p by means of action newconf(c) p . 

When a new primary configuration c has been given in output to process p by means of action 
newconf(c) p , the client at p submits its current state together with a condenser function to be used 
to compute the starting state when all other members have submitted their state (such a condenser 
function depends on the application). Clearly the state of p is needed by other processes in the 
configuration while p needs the state of the other processes. Hence when a submit-state(s, </>) p is 
executed at p, the state s submitted by process p is sent out with a got-state message to all other 
members of the configuration, using the cs service. Upon receiving the state of all other processes, 
CS-TO-DCp uses the condenser function <f> provided by the client at p in order to compute the starting 
state to be output, by means of action newstate(s) p , to the client at p. 

The communication mechanism of DC is quite different from that offered by CS: DC offers a 
broadcast /convergecast communication mechanism, while CS offers point-to-point communication 
channels. However it is not difficult to implement the former by using the communication service of 
the latter. The relevant code is in Figure 6-3. When a message m is submitted by means of action 
suBMiT(m,0,6,i) p , together with a condenser function <f>, a quorum-type b e {"read", "write"}, and 
an operation identifier i, message m, tagged with i, is sent to all the members of the configuration 
using the CS service, and an operation descriptor for i is initialized. When another process q receives 
the message m of operation i, it delivers it to its client by means of action deliver^, i) q . At this 
point the client at q is expected to supply an acknowledgment value a for operation i by means of 
action ackdlvr(<m) 9 . This value is sent back to p using the CS service. When p receives this value it 
updates the descriptor of operation i with the value obtained from q. If there are acknowledgment 
values from a quorum of the type specified by b, then the condenser function <j> is applied to the 
acknowledgment values of this quorum in order to compute a response to the message m submitted 
by p. Such a response is given to p by means of action respond^, i) p . 

There are five derived variables for dvs-impl. Four of them are analogous to those of dvs, 
indicating the attempted, totally attempted, established, and totally established views, respectively. 
A fifth one, use p , keeps track of the set of configurations used to check the intersection property 
before attempting a new configuration. 

6.4.2 Invariants of dc-impl 

This section contains invariants of dc-impl needed for the proof that dc-impl implements DC in 
Section 6.4.3. The proofs of these invariants are similar to those of the corresponding invariants of 
the dvs implementation (see Section 5.2.2). This is because the implementation of DC is similar to 
the implementation of dvs, thus many basic invariants are the same. For these basic invariant we 
provide an operational proof (i.e., a proof that does not rely exclusively on the state previous to the 
one for which the invariant states a property) for each of the invariants. 
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CS-TO-DC 



Signature: 

Input: 



cs-newconf(c) p , c £ C, p £ c.set 
CS-GPRCv(m)q jP , m £ M, q £ V 
CS-SAFE(m) g , p , m £ M, q £ V 
submit-state(s,0) p , s £ 5, 4> e * 
SVBMil(m,4>,b, i) p , m £ .M c , </>£*, 

6 £ {"neod", "uirite"}, p eV,i e OID p 
ACKDLVR(a, i) p , a £ A, i £ 0/i>, p£? 



State: 



cur £ Cj_, init co if p £ .Po, -L else 

client- cur £ Cj_, init Co if p £ Po, -L else 

act £ C, init co 

ami £ 2 C , init 

attempted £ 2 C , init {co} if p £ Po, ' 

for each g £ G, q d V 

info-rcvd[q, g] £ (C x 2 c )j_, init ± 
rcvd-estb[q, g] £ (C x 2 c )j_, init _L 



else 



Internal: garbage-collect(c) p , c £ C 
Output: CS-GPSND(m)p,m£M 

dc-newconf(c) p , c eV, p e c.set 

dc-newstate(s) p , s £ S 

DELivER(m, i) p , m £ M c , i £ OID, p £ V 

RESPOND(a, i) v , a e A,i e OID v , p £ V 



for each g £ Q 

to-cs[g] £ seqof(M), init A 
info-sent[g] £ (C x 2 c )j_, init ± 
d/i;-g«eue[g] £ seqof(M), init A 
cond[g] £ *j_, init _l_ 
pe«d[<?] £ 0, init _L 
msg-dlvd[g] = OID — > {true, false} 
siaie-<;oi[<;] = "P — > Sj_, init _L 
estb[g] a bool, init true if p £ Po and g = go, 
false else 



Derived variables: 

Att £ 2 C , defined as Att = {c £ created \ (3p £ c.set)c £ aitemp£ed p }; 
£st £ 2 C , defined as fst = {c £ created | (3p £ c.set)est6[c.id]p = true}; 
TbtAtt £ 2 C , defined as TbtAtt = {c £ created | (Vp £ c.set)c £ attempted p }; 
TbtSst £ 2 C , defined as TbtSst = {c £ created | (Vp £ c.set)est6[c.id]p = true}. 
use £ 2 C , defined as use = {act} U amb 



Transitions: 

input cs-newconf(c) p 
Eff: cur := c 

append ("info" , act, amb) to to-cs[cur.id] 
in/o-serai[ci(r.Jd] := (act, amb) 

input cs-GPRCV ((" info" ,c,C))q lP 
Eff: info-rcvd[q, cur.id] := (c,C) 
if c.id > act.id then act := c 
ami := {«; £ amb U C | w.jd > act.id} 

input cs-SAFE({"m/o",c, C))g, p 
Eff: none 

output newconf(c) p 
Pre: c = cur 

c.id > client-cur .id 

Vg £ c.set, q ^ p : info-rcvd[q,c.id] ^ _l_ 
Vu; £ use : 3R £ w.rqrms, R £ c.set 
Viu £ use : 3W £ w.wqrms,W £ c.set 
Eff: ami := amb U {c} 

attempted := attempted U {c} 
client-cur := c 

input submit-state(s,</)) p 
Eff: g = client-cur. id 
if g ^ ± then 
state- got [g](p) ■= s 
cond[g] := <?i 
append ("state-got" ,s) to to-cs[g] 



input cs-GPRCv( { "state-pot", s))q jP 
Eff: stete-pot[cur.j<i](g) := s 

input cs-SAFE({ "state-got" , s)) 9j p 
Eff: none 

Output NEWSTATE(s)p 

Pre: g = cur.id 

V<? £ c.set, state- got[g](q) ^ _l_ 
s = co«d[g](state-goi[g]|cu7\set) 
est6[g] = false 
Eff: est6[g] := true 

append "established" to to-cs[g] 

input cs-GPRCv("esta6Hs/je<f') gj p 
Eff: rcvd-estb[q, cur. id] := true 

input cs-SAFE("esta&/is/je<f')g,p 
Eff: none 

internal GARBAGE-COLLECT(c)p 
Pre: Vg £ c.set, rcvd-estb[q,c.id] = true 

c.id > act.id 
Eff: act := cur 

amb := {w £ amb | raid > act.id} 

output cs-GPSND(m)p 
Pre: m is head of to-cs[cur.id] 
Eff: remove head of to-cs\cur.id\ 



Figure 6-2: CS-TO-DC p 
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CS-TO-DC (transitions cont'd) 



input svbmit (m,(p,b, i) v 
Eff: g = client-cur .id 
if g 7^ _L then 
pend[g](i) := (m, </>, 6,0, A(x) : x — > _l_, false) 
append (to, j) to to-cs[g] 

input cS-GPRCV({m, i))q, p 
Eff: append (m,i) to d^-gueue[cur.«rf] 

input CS-SAFE({m, i))q,p 
Eff: none 

Output DELIVER(m, i) p 
Pre: (m,i) = head(dlv-queue[cur.id]) 

msg-dlvd[cur.id](i) = false 
Eff: dlv-queue[cur .id] = tail(d/«-g«eue[c«r.id]) 

msg-d^d[cur.«rf](j) := true 



input ACKDLVR(a, i) p 
Eff: append (a,j) to £o-cs[d«en£-cwr.«d] 

input CS-GPRCv({a, i)) q ,p 
Eff: if j £ CUD P then 

pe«d[i].acft(g) := a 

input CS-SAFE({a, i)) g ,p 
Eff: none 

output RESPOND(r, i)p choose Q 
Pre: g = cur. id 

i e OIDp 

pend[g](i) .rsp = false 
if pend[g].sel = "read" 
then Q G cur.rqrms 
else Q G cur .warms 
let / = pend[g](«).acfc 
V<? £ Q : /(«) 7^ -L 
r = penrf[g](i).cnd(/|Q) 
Eff: pend[g](i) .rsp := true 



Figure 6-3: CS-TO-DC p (transitions cont'd) 

Invariant 6.4.1 (dc-IMPl) 

In any reachable state if ( "info ", x, X) € to-cs[g\ p or ( "info ", x, X) € pending\p, g] or (( "info ", x, X), p) S 

queue[g] or info-rcvd\p, g\ q = (x,X), then info-sent[g] p = (x,X) and cur.id p > g. 

Proof Sketch: This invariant is true because whenever process p puts the message {"info" ,x,X) 
into to-cs[g] p in action cs-newconf(c), where c.id = g, it sets info-sent[g] p := (x,X). Moreover at 
that moment it also sets cur := c. From that moment on, because configuration identifiers provided 
by CS only increase, we have that cur.id p > g. Clearly this continues to be true when the "info" 
message goes through pending\p, g], queue[g] and finally gets to q and is recorded into info-rcvd\p, g\ q . 
D 

Invariant 6.4.2 (dc-IMPl) 

In any reachable state, if info-sent[g] p = {x,X), w G attempted p , and w.id < g, then either w G 

{x} U X or w.id < x.id. 

Proof Sketch: If process p sent an "info" message for configuration g and has also attempted a 
previous configuration w, then either process p has already garbage-collected configuration w (if 
a configuration with identifier bigger than w.id has been totally established) which implies that 
w.id < x.id or w is still in the use set of p, which implies that w G {x} U X. □ 

Invariant 6.4.3 (dc-IMPl) 
In any reachable state: 
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1. actp G IbtSst. 

2. If info-sent[g] p = (x, X) then x G lbt£st. 

3. use p n IbtSst ^ 0. 

Proof Sketch: Variable act p is initially a totally established configuration and is updated always 
to a totally established configuration (see action garbage-collect). Hence Part 1 follows. Part 2 
follows from the fact that if info-sent[g] p = (x, X) then value of x is the value of the variable act p 
at the time when info-sent[g] p is written, and thus by Part 1 is totally established. Part 3 follows 
from Part 1 and the definition of use p . □ 

Invariant 6.4.4 (dc-IMPl) 
In any reachable state: 

1. If cur p ^ ± and w G use p , then w.id < cur.id p . 

2. If cur p ^ ± and client-cur p ^ cur p and w G use p , then w.id < cur.id p . 

3. If info-sent[g] p = (x, X) and w G {x} U X then w.id < g. 

Proof Sketch: The only action that adds a new configuration c to the use p is action newconf(c) p . 

The precondition of this action requires that cur p = c which implies cur. id = c.id. The conclusion 

follows from the property of CS that configurations identifier are released in increasing order. This 

proves Part 1. 

Part 2 follows by observing that when cur p ^ ± and client-cur p ^ cur p a new configuration has 

been provided by CS but it has not been attempted yet. This implies that the current configuration 

cur p cannot be in the use p set. Combining this with the conclusion of Part 1, we have that for any 

w G use p , w.id < cur.id p . 

Finally Part 3 follows form the fact that process p sends information for a new configuration which 

has not been attempted yet. Once again, implies that the current configuration cur p cannot be in 

the use p set and we conclude as for Part 2. □ 

Invariant 6.4.5 (dc-IMPl) 

In any reachable state, if info-rcvd[q, g] p = (x,X) and w G {x} U X, then either w G use p or 

w.id < act.id p . 

Proof Sketch: Since info-rcvd[q, g] p = (x,X) we have that process p has received (x,X) from 
process q. When receiving this information (action cs-GPRcv({"en/o",x,X))) process p updates its use p 
and act.id p sets. The conclusion follows from the code of cs-GPRcv({"m/o",x,X>). □ 
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Invariant 6.4.6 (dc-IMPl) 

In any reachable state, if c £ attempted and q £ c.set then cur.id q > c.id. 

Proof Sketch: Since c £ attempted^ we have that there has been a step newconf(c) p . By the 
precondition of this action, since q £ c.set we have that process p received information from q for 
configuration c, that is, info-sent[c.id] q = (x,X). By Invariant 6.4.1 we have that cur.id q > cid. 
D 

The following invariant states a simple fact about dc-impl. This invariant does not have a 
corresponding one in DVS. However, since the statement of this invariant is simple, also for this 
invariant we provide an operational proof. 

Invariant 6.4.7 (dc-impl) 

In any reachable state, if (m,i) £ dlv-queue[g] q then pend[g](i).msg p = m, where p is such that 

i £ OID p . 

Proof Sketch: This is true because if {m,i) £ dlv-queue[g] q we have that process q received the 
message {m,i) from a process p, with i £ OID p . Moreover process p sent the message {m,i) in 
action suBMrr(m, </>,&, i) p and this action sets pend[g](i).msg p = m. □ 

Next we provide the main invariants that are needed in the proof of the simulation relation. 

Invariant 6.4.8 states that if a configuration c has been attempted by a process p and its mem- 
bership contains a process q which has attempted a configuration w previous to c and there is no 
totally established configuration between w and c then c contains a read and a write quorum of 
w. Intuitively this is true because when p attempts c it must have received information from all 
the members of c, thus also from q; since q attempted w and w has not been garbage collected, 
because there are no totally established configurations in between w and c, process q includes w 
in the information it sends to p for configuration c. Invariant 6.4.9 generalizes Invariant 6.4.8 by 
claiming that the intersection property holds for any configurations w and c such that there is no 
totally established configuration x with w.id < x.id < c.id. 

Finally Invariant 6.4.10 provides an intersection property crucial to the simulation relation. This 
last invariant is the one where dead configurations are needed. 

Invariant 6.4.8 (dc-impl) 

In any reachable state, suppose that c £ attempted^, q £ c.set, w £ attempted , w.id < c.id, and 
there is no x £ TbtSst such that w.id < x.id < c.id. Then there exist R £ w.rqrms and W £ w. warms 
such that R U W C c.set. 

Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, only Cq is attempted, so the hypotheses 
cannot be satisfied. Thus, the statement is vacuously true. 
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For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s 1 for any possible step (s,ir,s'). Fix c, w, p, and q, and assume that c G s 1 .attempted , q G c.set, 
w G s 1 .attempted q , w.id < c.id, and there is no x G s' .lbt£st such that wid < x.id < c.id. Then also 
there is no x G s.TbtSst such that w.id < a;. id < c.id. We consider four cases: 

1. c G s. attempted p and u> G s. attempted q . 

Then by the inductive hypothesis we have that in state s there exist i? G w.rqrms and ^ G 
w.wqrms such that i?Ulf C c.set. Clearly this remains true in s 1 too (it remains true forever). 

2. c ^ s. attempted p and u> ^ s. attempted q . 

This cannot happen because we cannot have both c and u> becoming attempted in a single 
step. 

3. c ^ s. attempted p and u> G s. attempted q . 

Then 7r must be newconf(c) p . Since g G c.set, by the precondition of 7r we have that s.info-rcvd[q, c.id] v 
{x,X) for some x and X. Then Invariant 6.4.1 implies that s .info-sent[c.id] q = {x,X). Then, 
since w.id < c.id, Invariant 6.4.2 implies that either w G {x} U X or w.id < x.id. We claim 
that it must be w G {x}UX. Indeed if w.id < x.id, by Invariant 6.4.3 we have that x G s.lbtEst 
and by Invariant 6.4.4, Part 3 (used with w = x) we have x.id < c.id; thus we would have a 
totally established configuration x such that w.id < x.id < c.id. This contradicts the inductive 
hypothesis. So it must be w G {x} U X. 

By Invariant 6.4.5 we have that either w G s.use p or w.id < s.act.id p . In the former case, by 
the precondition of n, we have the needed conclusion. In the latter case, we obtain a contra- 
diction. Indeed by Invariant 6.4.3 we have s.act p G TbtEst. Moreover by the precondition of 
7r, s.cur p cannot be _L and s.cur p > s.client-cur p and, by definition, s.act p G s.use p . Hence 
by Invariant 6.4.4, Part 2, we have s.act.id p < s.cur.id p = c.id. Thus we would have a totally 
established configuration act such that w.id < act. id < c.id. This contradicts the inductive 
hypothesis. 

4. c G s.attempted p and w £ s. attempted . 

Then 7r must be NEwcoNF(«;) q . But this cannot happen. Indeed since c G s.attempted p and 
q G c.set, Invariant 6.4.6 implies that s.cur.id q > c.id. Since c.id > w.id, we have s.cur.id q > 
w.id. But the precondition of action n requires s.cur.id q = w.id, so 7r is not enabled in s. 

D 

Invariant 6.4.9 (dc-IMPl) 

In any reachable state, suppose that c G Att, and w G IbtSst, w.id < c.id, and there is no x G TbtSst 



such that w.id < x.id < c.id. Then there exist R G w.rqrms and W G w.wqrms such that RU W C 
c.set. 

Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, only Cq is attempted, so the hypotheses 
cannot be satisfied. Thus, the statement is vacuously true. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s' for any possible step (s, n, s'). Fix c and w, and assume that c G s'.Att, w G s'. TbtSst, w.id < c.id, 
and there is no x G s' .TbtSst such that w.id < x.id < c.id. We consider four cases: 

1. c G s.Att and w G s. TbtSst. 

The invariant follows from the inductive hypothesis. 

2. c ^ s.Att and w <£ s.TbtSst. 

This cannot happen because we cannot have both c becoming attempted and w becoming 
totally established in a single step. 

3. c ^ s.Att and w G s.TbtSst. 

Then n must be newconf(c) p for some p. The precondition of ir implies that, for any configura- 
tion y G s.use p , there exist R G y.rqrms and W G y.wqrms such that RUW Q c.set. Hence to 
prove the claim it is enough to prove that w G s.use p . We proceed by contradiction assuming 
that w £ s.use p . 

By Invariant 6.4.3, Part 3, s.use p C\ s.TbtSst ^ 0. Let m be the configuration in s.use p C\s. TbtSst 
having the biggest identifier. We know that ra^w because w £ s.use p . Also, ra^ c, because 
m G s.TbtSst and c ^ s.lbtEst. It follows that ra.id ^ w.id and ra.id ^ c.id. 

We claim that ra.id < w.id. We have already shown that ra.id ^- w.id. Suppose for the sake 
of contradiction that ra.id > w.id. From the precondition of action it we have that s.cur p = c 
and hence s.cur p ^ ±. Also from the precondition of n we have that s. client- cur p < s.cur p . 
Since m G s.use p , Invariant 6.4.4, Part 2, implies that ra.id < s.cur.id p and since s.cur = c 
we have that ra.id < c.id. So w.id < ra.id < c.id. Since m G s'.7bt£st, this contradicts the 
hypothesis of the inductive step. Therefore, ra.id < w.id. 

Let n be the configuration in s.lbtEst that has the smallest identifier strictly greater than that 
of m. Remember that w G s 1 . TbtSst and since n =newconf(c) p we have that w G s.TbtEst and 
thus such an n exists and satisfies ra.id < n.id < w.id. Since m G s.use p , the precondition of 
tv implies that there exist R G ra.rqrms and W G ra.wqrms such that RUW C c.set. Since by 
inductive hypothesis the invariant is true in state s, we have that there exist R 1 G ra.rqrms and 
W' G ra.wqrms such that R 1 U W C n.set. By the properties of quorums we have that there 



exists one process q G (R U W) fl (i?' U W) and thus we have that q G n.set fl c.set. By the 
precondition of it, s.info-rcvd[q,c.id] p = (x,X) for some x, X. Then Invariant 6.4.1 implies 
that s.info-sent[c.id] q = (x,X) and Invariant 6.4.3 says that x G s.TbtSst. Then Invariant 
6.4.4, Part 3 (used with w = x), implies that x.id < c.id. Since n G s.TbtSst, we have that 
n G s. attempted . Then Invariant 6.4.2 (used with w; = n) implies that either n G {a;} U X or 
n.id < x.id. In either case, {#} U X contains a configuration y G s.TbtSst (either n or x) such 
that nid < y.id < c.id. Then Invariant 6.4.5 implies that either y G s.use p or yid < s.act.id p . 
By Invariant 6.4.3, Part 1, s.act p G s.TbtSst and by definition, s.act p G s.use p . So in either 
case, the hypothesis that m is the totally established configuration with the largest identifier 
belonging to s.use p is contradicted. 

4. c G s..4tt and wj $ s.TbtSst. 

Then 7r must be newstate(-) p for some p. Let to be the configuration in s.TbtSst with the largest 
identifier that is strictly less than w.id. By the statement for s, we know that there exist 
B! G m.rqrms and W' G m.wqrms such that i?' U W G w.set and there exist R" G m.rqrms 
and W" G m.wqrms such that i?" U W" G c.sei. Hence, by the properties of quorums, there 
is a process q G w.set fl c.set. 

Since c G s.^4ii, there exists a process r such that c G s. attempted r . Thus also c G s 1 .attempted r . 
Since w; G s 1 .TbtSst, we have that u> G s 1 .attempted . By assumption, there is no configuration 
x G s'.TbtSst such that w.id < x.id < c.id. By Invariant 6.4.8 applied to state s' (with p = r), 
we have that there exist R G w.rqrms and W G m.wqrms such that i?U PF G c.sei, as needed. 

D 

So far, the proof that DC-IMPL implements DC has been very similar to the proof that DVS- 
IMPL implements DVS. The following invariant is different from the corresponding one in the DVS 
implementation, Invariant 5.2.17. Here is where dead configurations are needed. 

Invariant 6.4.10 (dc-IMPl) 

In any reachable state, if c,w G Att, w.id < c.id, configuration w is not dead and there is no 
x G TbtSst with w.id < x.id < c.id, then there exist R G w.rqrms and W G w.wqrms such that 
RUW C c.set. 

Proof: By induction on the length of an execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state, only c$ is attempted, so the hypotheses 
cannot be satisfied. Thus, the statement is vacuously true. 

For the inductive step assume the invariant is true in state s. We need to prove that it is true in 
s 1 for any possible step (s,ir,s'). Fix c and w, and assume that c G s' .Att, w G s 1 .Att, w.id < c.id, 
w $ s' .dead, and there is no x G s' .TbtSst such that w.id < x.id < c.id. We consider four cases: 
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1. w G s.Att and c € s.Att. Then the invariant is true by the inductive hypothesis. 

2. w s.^4ii and c ^ s.^4ii. This is not possible because a single action cannot make both w and 
c attempted. 

3. w &" s.Att and c € s.Att. Then it must be that 7t=newconf(«;)j,' for some process p' that 
attempts w. Since c € s.^4tt we have that c G s'.Att. Hence there exists p such that c € 
s'.attemptedp. 

Now we claim that there must exist a process q G c.set n w.set. 

Clearly we have that w £ TbtSst. Let Y = {y\y G 1bt£st,y.id < w.id}. We first show that 
Y is nonempty: The initial configuration is totally established, thus c G IbtZst; moreover 
Co.id < w.id. If CQ.id = w.id, then we have w = Co. But then u> € lbt£st, a contradiction to 
the definition of this case. So we must have c .id < w.id, which implies that c £ Y, so Y is 
nonempty. 

Now fix z to be the configuration in Y with the largest id. We have that there is no x € TbtEst 
with z.id < x.id < c.id. Then Invariant 6.4.9 implies that there exist R G z.rqrms and W £ 
z.wqrms such that R U W £ wj.set and also that there exist R 1 G z.rqrms and W' G z.wqrms 
such that i?UW G c.sei. By the properties of quorums we have that (RL)W)r\(R'L)W) ^ {}. 
Hence we have that there exists q such that q G c.set n w.set. 

Now we claim that u> G s 1 . attempted q . By contradiction assume that u> g" s 1 .attempted q . 
Since c G s' .attempted and q> G c.set we have that s' .info-rcvd[q, c.id] p = (x,X) for some x 
and X. By Invariant 6.4.1 we have that s'.cur.id q > c.id > w.id. Since we assumed that 
w s' .attempted , by definition of dead configuration we have that w is dead. But this 
contradicts the hypothesis of the invariant, which states that w is not dead. Hence it must be 
that w G s' .attempted q . 

Hence we have that c G s' .attempted , q G c.set, w G s 1 . attempted q , w.id < c.id and there are 
no i € s 1 .lbt£st such that w.id < x.id < c.id. By Invariant 6.4.8 we have that there exist 
R G w.rqrms and W G w.wqrms such that RUWC c.set. 

4. w G s.Att and c s.^4ii. Then it must be that 7t=newconf(c) p for some process p that attempts 
c. We have that s' .attempted . 

The rest of the proof is as in the previous case: it claims that there exists q G c.set n w.set, 
that w G s' .attempted and then uses Invariant 6.4.8 to get the needed conclusion. 

D 



91 



6.4.3 Proof that dc-impl implements DC 

We are now ready to prove that DC-IMPL implements DC. We first provide a function mapping states 
of DC-IMPL to states of DC. Then we will prove that this function is an abstraction function. 
The abstraction function is given in Figure 6-4. 



Let s be a state of to-impl. The state t = !Fdc{s) of TO is the following. 

1. t. created = Up^-ps. attempted 

2. for each p £ V, t.cur-cid\p] = s. client- cur. id p 

3. for each g £ Q, t.attempted[g] = {p\g = c.id,c £ s. attempted } 

4. for each g £ Q, t. state- dlv[g] = {p\s.estb[g] p } 

5. for each g £ Q, t.got-state[g](p) = s. state- got[g](p) p 

6. for each g £ Q, t.condenser[g](p) = s.cond[g] p 

7. for each g € Q and i € OID p , t.pending[g](i) = ± if s.pend[g](i) p = ±, otherwise is defined 
by: 



msg = s .pend[g](i) .msg 

end = s.pend[g](i).cnd p 

sel = s.pend[g](i).sel p 

dlv = {q\s.msg-dlvd[g](i) = true} 

• ack = {q\(a, i) £ s.to-cs q or (a, i) £ s.CS.pending[q, g] or s.pend[g](i).ack(q) p = a} 

• rsp = s.pend[g](i).rsp p 



Figure 6-4: The abstraction function Tdc- 

In order to prove that T^c is an abstraction function we need to prove that for any initial state s 
of dc-impl we have that Tads) is an initial state of DC and that for any possible step n of dc-impl 
there exists a sequence of a of steps of DC such that the trace of a, that is the externally observable 
behavior, is equal to the trace of n. Lemmas 6.4.11 and 6.4.12, prove the above. The proof of these 
lemmas is similar to the corresponding ones of DVS. The key difference is in the supporting invariant 
used in the proof, Invariant 6.4.10, which is crucial in proving the simulation relation for the case 
when the implementation executes action it =newconf(c) p , for some configuration c and some process 
P- 

Lemma 6.4.11 If s is an initial state of DC-IMPL then !Fdc(s) is an initial state of DC. 

Proof: Let s be the initial state of dc-impl. We have that s.attempted p is {c } for p^Po an d for 
p (£ Pq. Hence, by definition of Tdc we have that t. created = {co} which is as in the initial state of 

DC. 

We have that s. client- cur. id p is {go} for p^Po and ± for p £ Po Hence, by definition of Td c we 
have that t.cur-cid is {g } for p^Po and ± for p ^ P . This is as in the initial state of DC. 
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We have that s.estb[g] p is true for PePo,g = go and false otherwise. Hence we have that 

t.state-dlv[g] is P if g = g and if g ^ g . This is as in the initial state of DC. 

We have that s.cond[g] p = ± for all g and p. Hence t.condenser[g](p) = ± for all g and p, which is 

as in the initial state of DC. 

We have that pend[g](i) = ± for all g and i. Hence t.pending[g](i) = ± everywhere, which is as in 

the initial state of DC. 

Hence if s is an initial state of dc-impl, we have that Tads) is an initial state of DC. □ 

Lemma 6.4.12 Lets be a reachable state of DC-IMPL, Fdcis) a reachable state of DC, and (s,ir,s') 
a step of DC-IMPL. Then there is an execution fragment a of DC that goes from Fdc^s) to !Fdc( sl ), 
such that trace(a) = trace{n). 

Proof: By case analysis based on the type of the action 7r. (The only interesting case is where -k = 
newconf(?;)p.) Define t = !Fdc(s) and t' = !Fdc(s')- 

1. 7T = CS-CREATECONF(c) 

Then trace((s, n, s')) = A. Action n modifies created. The definition of T& c is not sensitive to 
this change. Therefore, t = t', and we set a = t. 

2. 7T = CS-NEWCONF(c) p 

Then trace{(s, it, s 1 )) = A. Action n modifies current- confid\p], cur p and info-sent[cur.id] p , 
and adds an "info" message to to-cs[cur.id] p . The definition of T& c is n °t sensitive to any of 
these changes. Therefore, t = t', and we set a = t. 

3. 7T = CS-GPSND(m) p 

Then tracers, it, s 1 )) = A. Action 7r just moves a message from the queue to-cs[cur.id] p to the 
queue CS.pending\p, current- confid\p]\. The definition of Tdc is not sensitive to this change. 
Therefore, t = t', and we set a = t. 

4. 7T = CS-ORDER(m,JJ,g) 

Then trace ((s,ir,s')) = A. Action 7r moves a message from CS.pending\p, g] to CS. queue[g]. 
The definition of Fd c is not sensitive to this change. Therefore, t = t', and we set a = t. 

5. 7T = CS-GPRCV({"Jn/o",C,C))g jP 

Then tracers, it, s 1 )) = A. Action 7r increments the next pointer in CS, and may modify act 
and amb in CS-TO-DC. The definition of T& c is not sensitive to this change. Therefore, t = t', 
and we set a = t. 

6. 7T = CS-SAFE(<"»n/o",C,C))g,p 
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Then trace ((s,tt,s')) = A. Action 7r increments the next-safe pointer in CS. The definition of 
Fdc is not sensitive to this change. Therefore, t = t', and we set a = t. 

7. n = newconf(c) p 

Then trace ((s,tt,s)) = n. In DC-IMPL, this action modifies only variables amb p , attempted p , 
client-cur p . We have s' .client- cur p = c and s' .attempted = s.attempted p U {c}. By definition 
of Tdci we have that t'.cur-cid\p] = s 1 .client- cur. id p , t'. created = Up^-ps 1 .attempted p and 
t' .attempted[c.id] = {p\c G s 1 .attempted }. Hence we have that t' . cur- cid\p] = c.id, t' . created = 
t. created U {c} and t' .attempted[c.id] = t.attempted[c.id] U {p}, while all other state variables 
in t' are as in t. 

We consider two cases: 

(a) c G t. created. 

In this case, we set a = (t,n',t'), where n' = newconf(c) p . The code shows that n' brings 
DC from state t to state t'. It remains to prove that n' is enabled in state t, that is, that 
c € t. created and c.id > t.cur-cid\p\. The first of these two conditions is true because of 
the defining condition for this case. The second condition follows from the precondition of 
7r in DC-IMPL: this precondition implies that c.id > s. client-cur. id p , and by the definition 
of Tdc we have t.cur-cid\p] = s. client- cur. id p . 

(b) c ^ t. created. In this case we set a = (£, it', t", it", £'), where 7r' =createconf(c) p , 7r" =newconf(c) p , 
and t" is the unique state that arises by running the effect of 7r' from t. The code shows 

that a brings DC from state t to state t'. It remains to prove that n' is enabled in t and 
that 7r" is enabled in t". 

• Action 7r'. We start by proving that n' is enabled in t. 

The precondition of 7r' requires that (i) Vu> € t. created, c.id ^ w.id and (ii) if c is not 
dead, the following two conditions, Cl and C2, are true. 

C\: Vu> £ t. created, w.id < c.id, either w is dead or 3x £ t.TbtZst satisfying w.id < 
x.id < c.id or there exist R G w.rqrms and H^ G w.wqrms such that i? U W C c.set; 
C2: Vw € t. created, c.id < w.id, either w is dead or 3x G t.TbtZst satisfying c.id < 
x.id < w.id or there exist R 6 c.rqrms and H^ € c. warms such that i? U W C w.set. 
— requirement (i). To see requirement (i), suppose for the sake of contradiction 
that there exists w € t. created such that w.id = c.id. The precondition of n 
in DC-IMPL implies that c = s.cur p , which implies that c G s.CS. created. Since 
u> G t. created, the definition of jF^c implies that u> G s.attemptedp for some q>. This 
implies that w; G s.CS. created. Hence both c and w are created, that is, belong to 
t. created and since w.id = c.id we have that c = w. But this is impossible since 
c ^ t. created and u> G t. created. 
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— requirement (ii). If c is dead, then requirement (ii) is trivially satisfied. Hence 
assume that c is not dead. We have to show that both Cl and C2 are true. 
Let us start with Cl. Assume that there exists w G t. created such that w.id < 
c.id, that w is not dead and that there is no x G t.TotZst satisfying w.id < x.id < 
c.id (otherwise Cl is true and we are done). Since w G t. created, by definition 
of Tdci w G s. attempted for some q. Clearly, w G s' .attempted . Therefore, 
w G s'.Att. By the code of n we have that c G s' .attempted . Therefore we also 
have c € s'.Att. Moreover, there is no x G s'.7bt£st satisfying w.id < x.id < c.id 
(this is true in t and thus, by definition of J 7 ^, is true in s and, because n' does 
not establish any configuration, it stays true in s'). By Invariant 6.4.10 we have 
that there exist R G w.rqrms and W G w.wqrms such that R U W C c.set, as 
needed to prove Cl. 

We look now at C2. Assume that there exists w G t. created such that c.id < w.id, 
and that there is no x G t.7bt£st satisfying c.id < x.id < w.id. We already know 
that c is not dead. Since w G t. created, by definition of Tdc, w G s. attempted for 
some q. Clearly, w G s' .attempted . Therefore, w G s'.Att. By the code of n we 
have that c G s' .attempted p . Therefore we also have c G s'.Att. Moreover, there 
is no x G s'.7bt£st satisfying c.id < x.id < w.id. By Invariant 6.4.10 we have 
that exist R G c.rqrms and W G c.wqrms such that RL) W C w.set, as needed to 
prove C2. 
This proves that 7r' is enabled in t. 

• Action 7r". Now we prove that 7r" is enabled in state t" . 

The precondition of n" requires that c G t". created and c.id > t" .cur-cid\p\. The first 
condition is true because c is added to created by n' . The second condition follows 
from the precondition of it in DC-IMPL: The precondition of -k implies that c.id > 
s. client- cur. id p . The definition of T& c implies that t.cur-cid\p] = s. client- cur. id p . 
Moreover, t" .cur-cid\p] = t.cur-cid\p\. It follows that c.id > t" .cur-cid\p\. Thus n" 
is enabled in state t". 

8. n = submit-state(o, 4>) p 

Then trace ( (s, n, s)) = tt. This action sets state- got[g](p) p := o, cond[g] p := <fi and appends 
(" state- got" ,6) to to-cs[g] p , where g = client-cur p . By the definition of J 7 ^ we have that 
t' .got-state[g](p) = o and t' .condenser[g](p) = <f>, while all other state variables are as in t. We 
set a = (t,n,t'). The code shows that n actually brings DC from t to t' . Moreover n is an 
input action, so it is always enabled. 

9. 7r = cs-gprcv({" state- got", o)) q , p 
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Then tracers, it, s 1 )) = X. Action 7r sets state-got[g](q) p := o. The definition of Tdc is not 
sensitive to this change. Therefore, t = t', and we set a = t. 

10. 7T = cs-S afe({" state-got", o)) q , p 

Then tracers, it, s 1 )) = X. Action 7r increments the next-safe pointer in CS. The definition of 
Tdc is not sensitive to this change. Therefore, t = t', and we set a = t. 

11. 7T = newstate(o) p 

Then trace((s, tv, s)) = it. This action sets estb[g] := true and appends the message "established" 
to to-cs[g], where g = cur.id p . By definition oiTdc we have that t'. state- dlv[g] = t. state- dlv[g]U 
{p} and this is the only difference between t' and t. We set a = (t, n, t'). The code shows that 
7r actually brings DC from t to t'. It remains to prove that n is enabled in t. The precondition 
of n in dc-impl are the same as those in DC. Thus n is enabled in DC because it is enabled in 

DC-IMPL. 

12. 7T = cs-GPRCv("estaW«s/je<f')p 

Then tracers, it, s')) = X. Action 7r sets rcvd-estb[q, cur.id p ] p := true. The definition of Tdc 
is not sensitive to this change. Therefore, t = t', and we set a = t. 

13. 7T = cs-S afe(" established" ) p 

Then tracers, it, s')) = X. Action 7r increments the next-safe pointer in CS. The definition of 
Tdc is not sensitive to this change. Therefore, t = t', and we set a = t. 

14. ix = garbage-collect(c) p 

Then tracers, it, s 1 )) = X. This action can modify act p and amb p . The definition of Tdc is not 
sensitive to these changes. Therefore, t = t', and we set a = t. 

15. TV = SUBMIT(m, </>, b, i) p 

Then trace ((s,n,s)) = n. This action sets pend[g] := (m, <j>,b, 0, _L, false) and appends {m,i) 
to to-cs[g], where g = client-cur. id p ^ ±. The definition of Tdc shows that this changes the 
queue of pending operations pending so that t' .pending[g](i) = (to, <f>, 6,0, ±, false) and this 
is the only difference between t and t'. We set a = (t,n,t'). The code shows that n actually 
brings DC from t to t' . Moreover n is an input action, so it is always enabled. 

16. TT = CS-GPRCv(ra,i)p 

Then tracers, it, s 1 )) = X. Action n appends (m,i) to dlv-queue[cur.id p ] p . The definition of 
Tdc is not sensitive to this change. Therefore, t = t', and we set a = t. 

17. 7T = cs-safe(to, i) v 
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Then trace {{s,tv,s')) = A. Action tv increments the next-safe pointer in CS. The definition of 
Td c is not sensitive to this change. Therefore, t = t', and we set a = t. 

18. TV = DELIVER(m, l) p 

Then trace((s, tv, s)) = tv. This action deletes the head of dlv-queue[g] p and sets msg-dlvd[g](i) := 
true, where g = cur.id p . By the definition of J 7 ^ we have that the only thing that changes 
from t to f is pending[g](i).dlv, that is t' .pending[g](i).dlv = t.pending[g](i).dlv U {p}. We set 
a = (t, tv, t'). The code shows that tv actually brings DC from t to t' . It remains to prove that 
tv is enabled in t. From the precondition of tv we have that s.msg-dlvd[g](i) p = false and thus 
we have that p (£ t.pending[g](i).dlv. From the precondition of tv we have that (m,i) is the 
head of dlv-queue[g] p . By Invariant 6.4.7 we have that t.pending[g](i).msg p = m. Hence tv is 
enabled in state t of DC. 

19. TV = ACK-DLVR(a, l) p 

Then trace ((s,tv,s)) = tv. This action appends (a, i) to to-cs[g] p where g = client-cur. id p . By 
the definition of J 7 ^ we have that the only thing that changes from t to f is pending [g] (i).ack, 
that is t' .pending[g](i).ack = t.pending[g](i).ack U {p}. We set a = (t,iv,t'). The code shows 
that tv actually brings DC from t to t' . Moreover tv is enabled in t because it is an input action. 

20. TV = CS-GPRCv(a, i) v 

Then trace {{s,tv,s')) = A. Action tv sets (m,i) to dlv-queue[cur.id p ] p . The definition of T& c is 
not sensitive to this change. Therefore, t = t', and we set a = t. 

21. TV = CS-SAFE(a, i) p 

Then trace {{s,tv,s')) = A. Action tv increments the next-safe pointer in CS. The definition of 
Fdc is not sensitive to this change. Therefore, t = t', and we set a = t. 

22. TV = RESPOND(r, i) p 

Then trace ((s,tv,s)) = tv. This action sets pend[g](i).rsp := true. By the definition of J 7 ^ we 
have that t' .pending[g](i) .rsp = true. We set a = (t,iv,t'). The code shows that tv actually 
brings DC from t to t' . It remains to prove that tv is enabled in t. The precondition of tv in 
dc-impl is as the precondition of tv in DC. Hence tv is enabled in DC because it is enabled in 

DC-IMPL. 

D 

Lemmas 6.4.11 and 6.4.12 prove that J 7 ^ is an abstraction function from DC-IMPL to DC and thus 
the following theorem holds. 

Theorem 6.4.13 Every trace of DC-IMPL is a trace of DC. 
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6.5 An application of DC 

In this section we show how to use DC to implement an atomic multi-writer multi-reader shared 
register. The algorithm is an extension of the single-writer multi-reader atomic register of Attiya, 
Bar-Noy and Dolev [12]. A similar extension was provided in [66]. 

6.5.1 Overview 

In this section we provide a description of the algorithm and the code. We start with the description 
of the algorithm. 

Each process keeps a copy of the shared register, in variable val paired with a tag, in variable 
tag. Tags are used to establish the time when values are written: a value paired with a bigger tag 
has been written after a value paired with a smaller tag. Tags consists of pairs {j, p) where j is a 
sequence number (a non negative integer) and p is a process identifier. Tags are ordered according 
to their sequence numbers with processes identifiers breaking ties. Given a tag {j,p) the notation 
t.seq denotes the sequence number j. 

The algorithm has two modes of operation: a normal mode and a reconfiguration mode. The latter 
is used to establish a new configuration. It is entered when a new configuration is announced (action 
newconf) and is left when the configuration becomes established (action newstate). The former is 
the mode where read and write operations are performed and it is entered when a configuration is 
established and is left when a new configuration is announced. During the reconfiguration mode 
pending operations are delayed until the normal mode is restored. Variable conf-status is used to 
keep track of the mode (values exch-ready, exch-wait are for the reconfiguration mode). 

Clients of the service can request read and write operations by means of actions read p and 
write(x) p . We assume that each client does not invoke a new operation request before receiving the 
response for the previous request. Both type of requests (read and write) are handled in a similar 
way: there is a query phase and a subsequent propagate phase. During the query phase the server 
receiving the request "queries" a read-quorum in order to get the value of the shared register and 
the corresponding tag for each of the members of the read-quorum. From these it selects the value 
x corresponding to the max tag t. This concludes the query phase. In the propagation phase the 
server sends a new value and a new tag (which are (x, t) for the case of a read p operation and 
(y, (t.seq + l,p)) for a write(j/) p operation) to the members of a write quorum. These processes 
update their own copy of the register if the tag received is greater than their current tag; then they 
send back an acknowledgment to the server p. When p gets the acknowledgment message from the 
members of a write quorum, the propagate phase is completed. At this point the server can respond 
to the client that issued the operation with either the value read, in the case of a read operation, or 
with just a confirmation, in the case of a write operation. 
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We remark that when a configuration change happens during the execution of a requested oper- 
ation, the completion of the operation is delayed until the normal mode is restored. However if the 
query phase has already been completed it is not necessary to repeat it in the new configuration. 

We denote by T = {{j,p)\j £N,peP} the set of tags. This set is ordered according to the first 
component, with the process identifiers breaking ties. We denote by X the set of values that the 
shared register can assume. We assume that there is a default value x £ X. Initially all members 
of c have the shared copy of the register set to x . 

All other data types used in the code are as defined for DC. 

Figures 6-5 and 6-6 provide the code of abd-code p ; in this code code, we use the following 
condenser functions: 

- 4> max tag which computes the value and the tag of the max tag register copy. Formally this 
function takes a collection of tuples Z = {(" query- ack v ,v,t,h)}, with t € T and returns one 
such quadruple which has a maximum tag t among the elements of Z together with the set Q 
of processes that submitted the tuples; formally it returns a tuple ("query-ack" ,v,t,h,Q). 

- 4>ack which returns an acknowledgment. Formally this function takes a collection of pairs 
Z = {("prop" ,£)}, all of which have the same value t e T, and returns ("prop-ack" ,t,Q), 
where Q is the set of processes that submitted the pairs. 

- Estate which computes the up to date state for a new configuration. Formally it takes a 
collection of triples Z = {(x, t, h)} where x is a value, t a tag and h a configuration identifier, 
considers the subset Z' of the triples of z with maximum h and returns the first two components 
of a triple of Z' which has the maximum tin Z'. Such a triple can be picked by default, choosing 
e.g., the one that came from the process with the smallest identifier. 



abd-code (signature and state) 



Signature: 

Input: READp, p £ V 

write(x) p , x e x, p e v 

DELivER(m, %) v , m £ M, i £ OID, p £ V 
RESPOND(a, i) p , a £ A, i e OID p , p £ V 
newconf(c) p , c e C, p e c.set 

newstate(s) p , s e S, 

State: 

current G Cj_ init c Q if p £ Po else _l_ 
high £ G±_ , init go if p £ -Po else _L 
val £ X , initially xo 
tag £ T, initially (0,p) 
prop-val £ X, initially xo 
prop-tag £ T, initially (0,p) 



Output: SUBMlT(m,<ji,ft,i)j,, m £ M, </>£*, 
be {"r","w"},;p£ V,ie OID v 
ACKDLVR(a, i)p, a £ A, i £ OID, p £ V 
READ-CONFIRM (x) p , X £ X, p £ V 
WRITE-CONFIRMp, p £ V 
SUBMIT-STATE(s,</>)p, S £ 5, </> £ *, 



conf-status £ {normal, exch-ready , exch-wait} , init normal 
status £ {query-ready, query-wait, prop-ready , 

prop-wait, prop-done}, init query-ready 
request £ {"read", ("write", x), _l_}, init _l_ 
ack-q, seqof(A x OID), init A 



Figure 6-5: The abd interface and state. 



We define the system abd-sys as the composition of DC and abd-code p for each p € V. 
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ABD-CODE (transitions) 



Actions: 

input READp 
EfF: request := "read" 

input write(x) p 
EfF: request := ("write" , x) 

Output SVBMIT(" query", 4>maxtag, "r" ,i) p 

Pre: current ^ _l_ 

% £ used-ids 

conf-status = normal 

status = query-ready 

request ^ _l_ 
Eff: used-ids := used-ids U {«} 

status := query-wait 

input responb((" query-ack",x,t,h,Q),i) p 
EfF: if request = "read" then 
prop-tag := t 
prop-val := x 
if request = ("write", y) then 
prop-tag := (t.seq + l,p) 
prop-val := y 
status := prop- 



output SUBMIT((" prop", X,t), 4>ack, "w",l)p 

Pre: current ^ _l_ 

i £ used-ids 

conf-status = normal 

status = prop-ready 

x = prop-val 

t = prop-tag 
EfF: status := prop-wait 

input respond( ( "prop-ack" ,t,Q),i) p 
EfF: status := prop-done 

Output READ-CONFIRM (x) p 

Pre: conf-status = normal 

status = prop-done 

request = "read" 

x = prop-val 
EfF: request := _l_ 

status := query-ready 



Output WRITE-CONFIRMp choose X 

Pre: conf-status = normal 

status = prop-done 

request = ("write" ,x) 
EfF: request := _l_ 

status := query-ready 

input deliver( " query", i) p 
EfF: append ((" query- ack" ,val, tag, high), i) to acA;-q 

input deliver((" prop" ,x,t),i) p 
EfF: if t > tag then 
uaZ := x 
tag := t 
append (("prop-ack" , t), i) to ack-q 

output ACKDLVR(a,i) p 
Pre: head(ack-q) = (a,i) 
EfF: ack-q := tail(ack-q) 

input newconf(c)p 
EfF: current := c 

conf-status := exch-ready 
if status = query-wait then 

status := query-ready 
if status = prop-wait then 

status := prop-ready 
ack-q := A 

output SUBMIT-STATE((x,t, h) , (j) state) p 

Pre: conf-status = exch-ready 

x = val 

t = tag 

h = high 
EfF: conf-status := exch-wait 

input newstate(x, t) p 
EfF: conf-status := normal 
if t > iag then 
uaZ := x 
fag := t 
high := current 



Figure 6-6: The ABD-CODE transitions. 
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6.5.2 Proof that ABD-SYS is an atomic register 

In this section we prove that ABD-SYS implements an atomic read/write shared register. This proof 
uses an approach similar to that used in Chapter 5 and in [41] to prove the correctness of applications 
built on top of DVS and VS, respectively. 

We need the following history variables. 

For a process p and a configuration identifier g variable buildtag\p, g] € 71 is defined as follows: 

• If current.id p = g then buildtag\p, g] = tag p ; 

• If current.id p > g then buildtag\p, g] is the value of tag p at the moment when process p leaves 
configuration g; 

• If current.idp < g then buildtag\p, g] = ±. 

Informally this buildtag\p, g] records the value of the latest tag, if any, used in configuration g by 
process p. 

The value of buildtag\p, g] can be easily computed by following the statement that modifies tag p in 
actions deliver and newstate, with another statement buildtag\p, current.idp] := tag p . It should be 
clear that (i) when p is in configuration g we have that buildtag\p, g] = tag p and (ii) after p leaves 
configuration g forever afterwards, buildtag [p, g] contains the value of tag p at the point when p left 
configuration g. 

We also need history variables to record the beginning and the end of query and propagate 
phases of each operation, the configurations where the phases are executed, the quorum of precesses 
involved, as well as the tags returned by the query phases and the ones written by propagate phases. 
We define the following history variables: 

• query-begin[i] G {true, false}, initially false for all i e OID. This variable is set to true 
when action svBMn("query",4>maxtag, "r",i) p is executed by some process p. Informally, when 
query-begin[i] = true the query phase of operation i has started. 

• query-end[i] e {true, false}, initially false for all i e OID. This variable is set to true 
when action responb((" query- ack",x,t, h, Q), i) p is executed by some process p and for any value 
of x, i, h and Q. Informally, when query-end[i] = true the query phase of operation i has been 
completed. 

• query- quorum[i] € 2^, initially ± for all i e OID. This variable is set to Q when action 
responb((" query- ack",x,t,h,Q), i) p is executed by some process p and for any value of x,t and h. 
Informally, query- quorum[i] records the read quorum used by the query phase of operation i. 

• query- conf[i] € Cj_, initially ± for all i € OID. This variable is set to c when action 
responb((" query- ack",x,t,h,Q), i) p is executed by some process p when current p = c, and for 
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any value of x, £, /i and Q. Informally, variable query- conf[i] records the configuration in which 
the query phase of operation i is performed. 

• query-tag[i] G 7j_ , initially ± for alH G OID. This variable is set tot when action responb((" query-ack",x, h, t,Q),i) p 
is executed by some process p and for any value of x, h and Q. Informally, variable query-tag[i] 

records the tag returned by the query phase of operation i. 

• prop-begin[i] G {true, false}, initially false for all i € OID. This variable is set to true 
when action submit(( "prop", x, t), 4> ac k, "w",i) p is executed by some process p and for any value 
of x and t. Informally, when prop-begin[i] = true the propagation phase of operation i has 
started. 

• prop-end[i] G {true, false}, initially false for alH G OID. This variable is set to true when 
action RESPONB(("prop-ack" ,t,Q), i) p is executed by some process p and for any value of t and Q. 
Informally, when prop-end[i] = true the propagation phase of operation i has been completed. 

• prop-quorum[i] G 2 V ±, initially ± for all i G OID. This variable is set to Q when action 
RESPONB(("prop-ack" ,t,Q),i) p is executed by some process p and for any value of t. Informally, 
prop-quor-um[i] records the write quorum used by the propagate phase of operation i. 

• prop-conf[i] G C± , initially ± for alH G OID. This variableis set toe when action RESPONB(("prop-ack" ,t,Q),i) p 
is executed by some process p when current p = c and for any value of t and Q. Informally, 

variable prop-conf[i] records the configuration in which the propagate phase of operation i is 
performed. 

• prop-tag[i] G 71 , initially ± for alH G OID. This variableis set to t when action RESPONB(("prop-ack" ,t,Q),i) p 
is executed by some process p for any value of Q. Informally, variable prop-tag[i] records the 

tag written during the propagate phase of operation i; this is also the tag associated with 
operation i. 

We define the set of summaries as the set Sum = {(#,£, g)\x G X,t G T, g G Q}. Given a 
summary y G Sum, y = (x,t,g) selectors for the components are y. value = x, y.tag = t and 
y.high = g. 

Informally a summary y = (x, t, g) is used to record that some process p has been into a state in 
which valp = x, tag p = t and high p = g. 

We write allstate\p, g] to denote a set of summaries defined so that (x, i, h) G allstate\p, g] if one 
of the following holds. 

1. current p = g and val p = x,tag = t and high = h. 

2. got-state[g](p) = (x,t,h) 
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3. Message ("query-ack" ,x,t,h) is somewhere in message mechanism of DC, more formally: 

• {(" query- ack" ,x,t,h),i) G ack-q p , for some operation identifier i; 

• pending [g] (i).ack(p) = ("query-ack , \x,t,h), for some operation identifier i. 

4. pending [g] (i).msg p = ("prop" ,x,t, /i), for some operation identifier i. 

We write allstate[g] to denote {J peP allstate\p, g\, and allstate to denote U 9 eG ollstate[g]. 

If Y is a partial function from process identifiers to summaries, then we define: maxprimary (Y) = 
m.ax qe ,jom(Y){Y(q).high}, reps(Y) = {q G dom(Y) : Y(q).high = maxprimary} and chosenrep(Y) 
denotes some element q' in reps(Y) that maximizes Y(q').tag. 

Next we provide some preliminary invariants. Since these invariants state simple facts and also 
some of them are very similar to the ones used in [41], we provide operational proofs instead of 
formal assertional proofs. 

Invariant 6.5.1 (abd-SYS) 

In any reachable state, we have that current.id p = cur-cid p . 

Proof: Both variables are initially g for p G -Po and ± for p ^ P . Both variables are set to c.id 
when action newconf(c) p is executed. □ 

Invariant 6.5.2 (abd-SYS) 

In any reachable state, if query- end[i] = true or prop-end[i] = true, then we have that prop- conf[i] G 

created \ dead . 

Proof: By definition of prop-conf, query-end and prop-end we have that if query-end[i] = true or 
prop-end[i] = true, then prop-conf [i] ^ ±. Let c = prop-conf[i]. Clearly c must be created. 
The query phase and the propagate phase are executed in normal processing, that is when conf-status = 
normal for each process involved. Thus processes participating in the query or in the propagate phase 
have executed action newstate for configuration c. Such an action is executed only when all members 
of c have submitted their state to the DC service. In order to submit their state for configuration c, 
each member must have executed action newconf(c). Hence c is not dead. □ 

Invariant 6.5.3 (abd-SYS) 

In any reachable state the following are true. 

1. high p G TotAtt 

2. If y G allstate then y.high G TotAtt. 
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Proof Sketch: For any summary y configuration y.high is a configuration which has been the high p 
for some process p. Hence it suffices to prove Part 1. Variable high p is set only by action newstate(-) p . 
This action is executed only if all the members of the configuration have submitted their state to 
the DC service. This implies that all the members of the configuration have attempted high p . □ 

Invariant 6.5.4 (abd-SYS) 
In any reachable state: 

1. high p £ dead 

%■ If V £ allstate then y.high £ dead. 

Proof Sketch: This invariant follows easily from the previous one since a totally attempted config- 
uration cannot be dead. □ 

Invariant 6.5.5 (abd-SYS) 

In any reachable state, the following is true: If c G created \ dead and 3y G allstate such that 

y.high > c.id then there exists R G c.rqrms such that for all p € R, current. id p > c.id. 

Proof: Let configuration c and y G allstate be such that c G created \ dead and y.high > c.id. 
By Invariant 6.5.3 we have that y.high G IbtAtt. Then By Invariant 6.3.3, we have that there 
exists R G c.rqrms such that for all p G R, cur-cid p > c.id. By Invariant 6.5.1 we have that 
currentp = cur-cid p . It follows that there exists R G c.rqrms such that for all p G R, current v > c.id. 
U 

Invariant 6.5.6 (abd-SYS) 

In any reachable state, the following is true. Let c, w be two configurations such that c.id < w.id 
and t G T ■ Let p G c.set fl w.set and let y = got-state[w .id](p) . If buildtag\p, c.id] > t, y ^ A. and 
y.high > c.id, then y.tag > t. 

Proof Sketch: Since y = got-state[w.id](p) ^ ± we have that current. id p > w.id > c.id. Since 
y.high > c.id we have that y.tag > buildtag\p,c.id\. By assumption buildtag\p, c.id] > t. Hence we 
have that y.tag > t. □ 

The following invariant is the analog of Invariant 6.13 of [41]. 

Invariant 6.5.7 (abd-sys) 

In any reachable state, for any p, for any summary y and for all c, w G created we have that: If 

state- dlv\p, c.id] ^ ±, c.id < w.id and y G allstate\p, w.id] then y.high > c.id. 
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Proof Sketch: Assume that p, y, c and w satisfy the hypothesis. Since state- dlv\p, c.id] ^ ±, we 
have that process p has executed action newstate(-) p for configuration c. When executing this action 
it sets high p := c. Hence any summary due to p for a later configuration has the high component 
which is at least c.id. This is true also for y which is a summary due to p for configuration w, since 
w.id > c.id. □ 

Invariant 6.5.8 (abd-SYS) 

In any reachable state, if got-state[g](p) ^ ± then current. id p > g. 

Proof Sketch: Since got-state[g](p) ^ ± we have that p submitted its state to DC. In order for 
process p to submit its state for configuration g it must be that current p = c, where c.id = g. 
Afterwards, by monotonicity of configuration identifiers, we have that current.id p > g. □ 

Next we provide a sequence of invariants which leads to the proof that ABD-SYS implements an 
atomic read/write shared register. 

Invariant 6.5.9 (abd-SYS) 

In any reachable state, the following is true. Let c G created \ dead, W G c.wqrms and let t G T ■ If 
for every r G W such that current.id r > c.id it holds that estb[c.id] r = true and buildtag[r, c.id] > t, 
then we have that every summary y £ allstate with y.high > c.id satisfies y.tag > t. 

Proof: By induction on the length of the execution. In the initial state, the only created configu- 
ration is Co, and there are no summaries y with y.high > go. So the invariant is vacuously true and 
the base case is proved. 

For the inductive step assume that the invariant is true in a state s. We need to prove that 
the invariant is true in s 1 for any step (s,7r, s'). To prove that the invariant is true in s 1 we fix 
c G s' .created\s' .dead, W G c.wqrms, and t G T, and assume that for every r G W, if s' .current. id r > 
c.id then s 1 .estb[c.id] r and s 1 .build-tag[r, c.id] > t. To prove the invariant we need to prove the 
following conclusion: for any summary y G s 1 . allstate such that y.high > c.id, we have that y.tag > t. 

Let us first consider the case when c ^ s. created. Since c G s 1 . created, action ir must be 
createconf(c). We consider two subcases. 

1. 3r G c.set : s. current. id r > c.id. 

Fix such a process r. Since c has just been created, r has not attempted c in s so c G s.dead, 
which implies c G s' .dead. This contradicts the assumption that c G s' .created \ s' .dead. So 
this case is not possible. 

2. jBr G c.set : s. current. id r > c.id. 
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In this case, we claim that the invariant is true because there is no y in s' .allstate with 
y.high > c.id. By contradiction, fix a y G s' .allstate such that y.high > c.id. By Invari- 
ant 6.5.4 configuration y.high is not dead. Then Invariant 6.5.5 applied to s' implies that there 
exists R G c.rqrms such that for all q € R, s 1 . current. id q > c.id. Fix some q € R. Since 
s. current. id q = s 1 .current. id q , it follows that s.current.id q > c.«d. But q> G c.set and thus we 
have a contradiction of the defining condition for this case. 

Hence in the case when c G s. created the invariant is true. For the rest of the proof we assume 
that c G s. created. Since c fi s 1 .dead we have that c ^ s.dead. 

As usual, the interesting steps are those that convert the hypothesis from false to true, and those 
that keep the hypothesis true while converting the conclusion from true to false. There are no steps 
that convert the hypothesis from false to true. So it remains to consider any steps that keep the 
hypothesis true while converting the conclusion from true to false. Thus, we assume that, for every 
r G W, if s. current. id r > c.id then s.estb[c.id] r = true and s.buildtag[r, c.id] > t. The only steps 
that can convert the conclusion from true to false are steps that produce a new summary (because if 
a summary y G s. allstate has y.high > c.id, then by the inductive hypothesis we have that y.tag > t 
and we are done.) 

Any step that produces a summary y by modifying an old summary y 1 G s. allstate, in such a 
way that y' .tag < y.tag and y 1 .high = y.high, is easy to handle: For such a step, y' .high > c.id and 
so the inductive hypothesis implies that t < y 1 .tag < y.tag, as needed. So the only concern is with 
a newstate action for some configuration w. 

Hence we assume that it =newstate({x,*, h)) v for some process p such that s.current p = w. 
Action n produces the following new summary y = {s'.val p , s'.tag p , s' .high p ) and since, by the code 
of 7r, s' .high = w.id we have y.high = w.id. Assume that y.high > c.id (otherwise we are done). In 
order to prove the invariant we have to prove that y.tag > t. 

Since y.high > c.id and y.high = w.id, we have that w.id > c.id. We also notice that con- 
figuration w is not dead in s. Indeed by the code of n we have that s'.high p = s' .current p and 
since s' .current p = s.current p = w we have that w = s'.high p . By Invariant 6.5.4 we have that 
w $ s' .dead. Clearly w $. s.dead. 

Let Y = s.got-state[w.id] and let p 1 = chosenrep(Y). Let y 1 be the summary y 1 = s.got-state[w.id](p'). 
Before proving that y.tag > t we prove two claims that are needed for the proof. 

• Claim 1. y' .high > c.id. 

Let c' denote the configuration which has the highest identifier in the set of configurations 
{c"|c" G s'.Tot£st,c".id < w.id}. 

Remember that w (£ dead and that c.id < w.id. 

We consider two possible cases: 
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1. c' .id > c.id 

Since c' £ s'.7bt£st, we have that c' $ s' .dead (and thus c' ^ s.dead). Also w; ^ s' .dead. 
By definition of c', we have that there are no totally established configurations in between 
c' and w. Then Invariant 6.3.1 shows that there exists R £ c' .rqrms such that R C w.set. 
Fix any q e R. Since c' £ s'.TbtSst we have that s.state-dlv[q, c'.id] 7^ _L. Let j/" = 
s.<70£-state[w;.ief](g). By Invariant 6.5.7, we obtain that y" .high > c'.id. By the definition 
of p' as a member that maximizes the high component in the summary recorded in 
got-state, we have y' .high > y" .high. Therefore y' .high > c' .id > c.id, completing our 
proof of the claim for this case. 

2. c' .id < c.id 

By assumption, c $ dead. We have observed above that w $ dead. By definition of 
c', we have that there are no totally established configurations in between c' and w and 
since c' '.id < c.id < w.id it follows that there are no totally established configurations in 
between c and w. Then Invariant 6.3.1 shows that there exists R £ c. rqrms such that 
R C w.set. We have that R n W 7^ 0. Let q> be any element of R n W. Since i? C w.set, 
we have that 5 £ w.set. Because s.<7o£-state[«;.id](q) 7^ ±, Invariant 6.5.8 implies that 
s. current. id q > w.«d. Since w.id > c.id we have that s.current.id q > c.id. 
Recall that we have assumed that for every r £ W, if s. current. id r > c.id then s.estb[c.id] r - 
true and s.buildtag[r, c.id] > t. Therefore, since q £ W and s. current. id q > c.id, we have 
that s.estb[c.id] q = true and thus s. state- dlv[q, c.id] 7^ ±. 

Let y" = s.got-state[w.id](q); thus y" £ s.allstate[q, w.id]. By Invariant 6.5.7, we obtain 
that y" .high > c.id. By the definition of p 1 as a member that maximizes the high compo- 
nent in the summaries recorded in got-state[w.id], we have y' .high > y" .high. Therefore 
y' .high > c.id, completing our proof of the claim for this case. 

Thus we have proved that y' .high > c.id. 

• Claim 2. If y' .high = c.id, then in s' there is no totally established configuration w' such that 
c.id < w' .id < w.id. 

To see this, consider again the totally established configuration c' with the largest identifier 
less than w.id. By the definition of c' it suffices to prove that c'.id < c.id. 
Neither c' nor w are dead. Since there are no totally established configuration in between c' 
and w, Invariant 6.3.1 implies that w.set contains a read-quorum of c', and thus an element of 
c' .set. That is, there exists q £ c' .setdw.set. Consider the summary y" = s .got-state[w .id](q) . 
By the precondition of 7r we have y" 7^ ± and thus we have that y" £ s.allstate[q, w.id]. By 
definition of c' as the totally established configuration with the largest identifier less than w.id, 
we have that c'.id < w.id and that state- dlv[q, c'.id] 7^ J_. Then Invariant 6.5.7 shows that 
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y" .high > c' .id. The definition of p' as a member that maximizes the high component among 
the summaries recorded in s.got-state[w.id], shows that y' .high > y" .high > c' .id. But the 
claim is conditional to the hypothesis that y' .high = c.id. So if y' .high = c.id we have that 
c.id > c' .id, which gives the claim. 

Hence we have proved that if y' .high = c.id in s' there is no totally established configuration 
w' such that c.id < w' .id < w.id. 

We are now ready to prove that y.tag > t. By Claim 1, we have that y 1 .high > c.id. 

If y' .high > c.id, by the inductive hypothesis we have that y' .tag > t and since y.tag > y 1 .tag, 
we have that y.tag > t, as needed. 

So suppose y' .high = c.id. By Claim 2, in s' there is no totally established configuration w' such 
that c.id < w' .id < w.id. 

We know that c $ s.dead and that w $ s.dead. Thus by Invariant 6.3.1 we have that there 
exists R £ c.rqrms such that R C w.set; therefore, since R fl W ^ 0, there exists q £ W fl w.set. 
By the precondition of n, using the fact that q £ w.set, we have s.5ot-staie[w.id](g) 7^ _L and thus 
s.current q > w.id > c.id. Thus also s 1 .current q > c.id. We have that q € W and s 1 .current q > c.id; 
for such a process we have that s 1 .buildtag[q, c.id] > t and s'.esi&[c.id] g = true. Since action 7r 
does not modify these variables we have that s.buildtag[q, c.id] > t and s.estb[c.id] q = true. The 
precondition of tv shows that s.got-state[w.id](q) 7^ ±. Let summary y" = s.got-state[w.id](q). 
Thus y" £ allstate[q, w.id]. Since s.estb[c.id] q = true we have that state-dlv[q, c.id] 7^ ±. By 
Invariant 6.5.7, we have that y" .high > c.id. By Invariant 6.5.6 we have that y" .tag > t. Recall 
that y 1 .high = c.id and by definition y 1 is a summary with maximal high. Since y" .high > c.id it 
must be that y" .high = c.id, and so the summary y" from q is among those with maximal high in 
s.got-state[w.id]. By the definition of p 1 as a member that maximizes the tag component, we have 
that y' .tag > y" .tag, so y 1 .tag > t. Since by the code y.tag = y 1 .tag, we have that y.tag > t, as 
needed. □ 

The next invariant states that when a completed propagate phase performed in a configuration c 
has propagate a tag t, all summaries whose /n<?ft component is greater than c.id have a tag component 
which is greater than or equal to t. 

Invariant 6.5.10 (abd-SYS) 

In any reachable state, if prop-end[i] = true, prop-tag[i] = t, c = prop-conf[i] and y £ allstate is a 

summary with y.high > c.id, then y.tag > t. 

Proof: Since prop-end[i] = true, by Invariant 6.5.2 we have that c £ created \ dead. 

Let W = prop-quorum[i] (this is the write quorum used in the propagate phase of operation 
i). Since prop-conf[i] = c, for all p £ W, we have that estb[c.id] p = true (because process p is 
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involved in the propagate phase of operation i, so it must have established c). This implies also that 
current. id p > c.id. Moreover since prop-tag[i] = t, if a processor p S W has current.id p > c.id, 
by monotonicity of the tag, we have that buildtag\p, c.id] > t (because process p is involved in the 
propagate phase of operation i and hence knows tag t; when it leaves configuration c, buildtag\p, c.id] 
must be at least t). 

By Invariant 6.5.9 we have that y.tag > t. □ 

The next lemma states a property of any execution of abd-sys. Namely, if a completed propagate 
phase propagates a tag t, any subsequent query phase that totally follows the propagate phase (that 
is, begins after the propagate phase has ended), gets a tag which is greater than or equal to t. 

Lemma 6.5.11 (abd-sys) 

In any execution, if the completed propagate phase of an operation i totally precedes the completed 

query phase of an operation j , then query-tag[j] > prop-tag[i] in any state where both are not ±. 

Proof: Let s be any state where query-tag\j] and prop-tag[i] are not ±, that is both the propagate 
phase of operation i and the query phase of operation j have been completed. Then we have 
s.prop-end[i] = true and s. query- end\j] = true. Clearly we also have s.query-begin[j] = true. Let 
(s',7r, «i) be the step when prop-end[i] is set to true, let (s",7r, s 2 ) be the step when query-begin\j] 
is set to true and let (s'",n, S3) be the step when query-end[j] is set to true. We must have that 
si precedes s", S2 precedes s'" and S3 precedes s in the execution. 

Let t = s.prop-tag[i]. We need to prove that s.query-tag\j] > t. 

Let W = s.prop-quorum[i] (this is the write quorum used by the propagate phase of operation 
i) and let R = s. query- quor-um\j] (this is the read quorum used by the query phase of operation j). 

Let C\ = s.prop-conf[i] and c 2 = s. query- conf\j]. We have W € C\.wqrms and R € c-i.rqrms. 

Since s .prop-end[i] = true and s.query-end\j] = true, by Invariant 6.5.2 we have that C\,c-i G 
created \ dead in state s. 

We consider three cases: 

1. c\.id = c-i.id 

Since c\ = c 2 we have that iJflW^S. Let q € R n W. Process q submits to the condenser 
function 4> ma xtag of operation j a tag which is greater than or equal to t. By definition of 
4>maxtag we have that s.query-tag[j] > t, which gives the claim. 

2. c\.id < ci.id 

Let p be any process of R. By the code we have that variable high p is changed only when 
action newstate p is executed, and is set to current. id p . Since p participates in the query phase 
of operation j there must be a state s in between s" and S3 such that that s'.high.id p = 
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s' .current. id p = c 2 .id. By monotonicity of configuration identifiers we have that s.high.id p > 
$' .high. id and since c\.id < c 2 .id we have that s.high.id p > c\.id. 

Now, let y p be the summary due to the acknowledgment value sent by p for the query phase 
of operation j. Notice that the tag y p .tag is used by the condenser function 4> ma xtag f° r the 
query phase of operation j. 

By Invariant 6.5.10, applied to state s using c = c±, we have that y p .tag > t. By definition of 
<f>maxtag we have that query-tag\j] > t. 

3. c\.id > c 2 .id 

We show that this cannot happen. There are two possible cases: 

(a) fix G s" '.Tbt£st such that c 2 .id < x,id < c\.id. 

By Invariant 6.3.1 applied to state s", there exists W £ c 2 .wqrms such that W C c\.set. 
We have that R n W ^ 0. Let p e i? n W. We have that p e ci.sci. 
In state si the propagate phase of operation i ends. It must be the case that every 
member of c± has 8±. current. id > c\.id and also that every member of c± has submitted 
its state for c± prior to the beginning of the query phase of operation j. Since p € c\.set, 
there must exist a state s preceding s± such that s.current p = c±. 

Since p € R there must exist a state s' in between s" and S3 such that s' .current p = c 2 . 
Since Si precedes s", we have that s precedes s'. By monotonicity of configuration iden- 
tifiers we must have s.current.id p < s' .current. id p , that is C\.id < c 2 .id. This contradicts 
the hypothesis that C\.id > c 2 .id. 

(b) 3x e s" .TbtEst such that c 2 .id < x.id < c\.id. 

Let c' be the totally established configuration with the smallest identifier intervening 

between c 2 and c\ in s". By definition of c' we have that c' .id > c 2 .id. 

By Invariant 6.3.1 applied to state s", there exists W G c 2 .wqrms such that W C c' .set. 

We have that i? n W' 7^ 0. Let p e i? n W. Since c' e s" JbtEst and p e c'.sei we have 

that s" .current. id p > c' .id. 

Since p e i?, there must exist a state s between s" and S3 such that s. current p = c 2 . 

By monotonicity of configuration identifiers, since s" precedes s we have that s.current.idp > 

s". current. id p . This implies that c 2 .id > c' .id > c 2 .id, which is impossible. 

D 

In order to prove that the system implements an atomic object we use the following lemma from 
[65] (Lemma 13.16, page 435). 
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Lemma 6.5.12 Let ft be a (finite or infinite) sequence of actions of a read/write atomic object 
external interface. Suppose that j3 is well-formed, and contains no incomplete operations. Let II be 
the set of all operations in ft. 

Suppose that -< is an irreflexive partial ordering of all the operations in II, satisfying the following 
properties: 

1. For any operation i £ II, there are only finitely many operations j such that j -< i. 

2. If the response event for operation i precedes the invocation event for operation j in ft, then it 
cannot be the case that j <i. 

3. If i is a write operation in II and j is any operation in II, then either i -< j or j -< i. 

4- The value returned by each read operation is the value written by the last preceding write 
operation according to -< (or a fixed initial value, if there is no such write). 

Then ft satisfies the atomicity property. 

We can use the above lemma to prove the following result. By Lemma 13.10 of [65] (page 419) 
we can restrict our attention to executions with no incomplete operations 

Theorem 6.5.13 ABD-SYS implements an atomic read/write object. 

Proof: In order to show that the system implements an atomic object we need to provide a partial 
order that satisfies Lemma 6.5.12. Let use define the order -< as follows. First define the tag of an 
operation i as tag(i) = prop-tag[i], that is, the tag written in the propagate phase. Order all writes 
operation in order of tag and place each read operation after the write operation with the same tag 
and before any other write operation (order of read operations in between two consecutive write 
operations is irrelevant). Place all reads for which there is no write operation with the same tag, 
before the first write operation. 

Next we prove that -< satisfies Lemma 6.5.12. Let us start with Point 1. Any operation j <i must 
have tag(j) < tag(i). The number of write operations that precede i is bounded by the number of 
tags which are strictly smaller than tag(i). This is a finite number. The number of reads which have 
a tag smaller than tag(i) is bounded by the number of read operations completed before operation 
i is completed. This is also a finite number. 

Now consider Point 2. Assume that the response event for an operation i precedes the invocation 
event for an operation j. Then we have that the propagate phase of operation i precedes the query 
phase of operation j and by Invariant 6.5.11 we have that the tag returned by the query phase 
of operation j is greater than or equal to the tag written by the propagate phase of operation i. 
Since the latter is equal to tag(i) and since the former is less or equal than tag(j) we have that 
tag(i) < tag(j). Thus it cannot be that j -< i because this means that tag(j) < tag(i). 

Ill 



Consider now Point 3. Assume that i is a write operation and that j is any other operation. 
Assume by contradiction than neither i -< j nor j -< i. Then we have that tag(i) = tag(j) and 
that j is also a write operation (a read operation with the same tag of i is such that i ~< j). Since 
tag(i) = tag(j) and since the process identifier is part of the tag, it must be the case that both 
operations are requested by the same process. Hence it must be the case that one of the operation, 
say i, is completed before the other, operation j, is requested. Thus the (completed) propagate 
phase of operation i precedes the query phase of operation j. Hence by Invariant 6.5.11 we have 
that the tag returned by the query phase of operation j is greater than or equal to the tag written 
by the propagate phase of operation i. The latter is equal to tag(i) and the former, by the code, 
is strictly less than tag(j). Hence we have that tag(i) < tag(j), which contradicts the fact that 
tag(i) =tag(j). 

Finally consider Point 4. Since each read is ordered right after the write with the same tag it 
is enough to show that a read operation i gets the value written by a write operation j such that 
t a dU) = tag(i). So let i be a read operation and let j be a write operation with tag(j) = tag(i). It 
follows by the code that the value returned by operation i is the one written by operation j (because 
tag and value are updated simultaneously). □ 

6.6 Remarks 

We remark that the intersection property of DC, namely that there exist a read quorum R and a write 
quorum W of a previous primary configuration both belonging to the next primary configuration 
comes from the particular application that we have developed. For other applications one might 
have different (maybe weaker) intersection properties. For example, one might require that the 
new primary configuration contains a read quorum of the previous one (and not necessarily a write 
quorum) . In our case, we must require both a read quorum and a write quorum in the new primary. 
If we do not require a read quorum to be in the new configuration but only require a write quorum 
to be in the new configuration, since write quorums may not intersect, two non-intersecting write 
quorums might concurrently proceed to two primary configurations, violating the uniqueness of a 
primary configuration. The same situation can happen if we do not require a write quorum to 
be in the new configuration but only a read quorum to be in the new configuration, because two 
read quorums may not intersect. In this latter case it is also possible for a read quorum in an old 
configuration to read obsolete values; indeed processes in a read quorum can be left behind if newer 
configurations are established but since they form a read quorum of the configuration they are in, 
they will be able to read whatever (obsolete) value they have. 

It is possible to optimize the state transfer at the beginning of a new configuration. The goal of 
the state transfer is to obtain all information from previous configuration. Clearly process that join 
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the system have no information about previous configurations. Hence it is useless to wait for them 
to submit their state before computing the new up to date state. 

We remark that the choice of integrating the state transfer into the service has been made because 
most applications have to perform such state exchange and thus it seems reasonable to do it within 
the service in order to free the application from the details of such a computation. We did not 
change the dvs service to also offer integrated state exchange because some applications may not 
require submission of the current state from every member of a new view or configuration. So it 
may be useful also to leave to the application control of the state exchange computation. 

The above remark is connected to the question: does DC supersede dvs? On one hand DC is 
more general than DVS because it provides a group communication service that handles configurations 
instead of views and configurations carry more information than views. On the other hand there are 
some differences between DC and dvs. We already talked about the difference in the state exchange 
mechanism. Another difference is in the communication mechanism used by the two services: DVS 
uses a point-to-point communication mechanism, while DC use a broadcast /convergecast mechanism 
involving a quorum of processes. Because of these differences we have kept the two services as 
different services. 

The DC service requires every process of a new configuration to submit its state. This is a strong 
requirement for applications that use quorums to improve availability. However it provides a strong 
service. It is possible to specify a weaker version of the DC service that requires only a read quorum 
to submit the state before computing the starting state of a new configuration. We believe that the 
TO algorithm we have developed in this chapter would still work with this weaker service. 

The ABD algorithm does not use the prefix property of message delivery guaranteed by the CS 
specification. Hence one could use a weaker specification instead of CS to implement DC. 

The implementation of DC performs garbage collection when a view becomes totally established 
(any previous view is discarded and no intersection checks are made with these discarded views). 
Yeger Lotem et al. [89] perform garbage collection when a view becomes established (the process 
that establishes a view discards all previous ambiguous views). Our garbage collection mechanism, 
though less efficient than that of [89], allows an easier proof of correctness. 
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Chapter 7 



Dynamic Algorithms 



In this chapter we apply the ideas about dynamic configurations developed in Chapter 6 to design 
a dynamic version of the PAXOS algorithm [61], called DPAXOS, and a dynamic primary copy data 
replication algorithm implementing an atomic object, called RAB. 

Both algorithms are built upon an underlying group communication service; this service, called 
dlc, is a variation of the DC service (see Chapter 6) which uses "leader configurations" (see Chap- 
ter 3) and augments the service with point-to-point communication. 

We start the chapter with the dlc service. Section 7.2 provides the dpaxos algorithm. In 
Section 7.3 we sketch the rab algorithm. Finally Section 7.4 contains concluding remarks. 

7.1 The dlc specification 

In this section we provide a dynamic primary configuration group communication service. This 
service, which we call DLC, is similar to the DC service; the differences are: (i) DLC handles leader 
configurations instead of read-write quorum configurations (see Chapter 3) and (ii) DLC provides 
point-to-point channels too. 

The dlc service is similar to the DC service. The code is provided in Figure 7-1. In Section 7.1.1 
we explain the differences between dlc and DC. We also provide a full description of dlc in 
Section 7.1.2; however the reader who comes from Chapter 6 and reads Section 7.1.1 can safely skip 
Section 7.1.2. 

7.1.1 Differences with DC 

There are basically two differences between DC and dlc. The first difference is due to the kind of 
configurations considered: leader configurations instead of read-write quorum configurations. As 
a result, the key intersection property becomes the following: for any two created and non dead 
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PLC 

Signature: 

Input: svBMiT(m,4>,i) p , m G M, 4> £ *, P £ P,i £ 0W P 
ACKDLVR(a, i) p , a e A, i e OID, p eP 
submit-state(s,</>) p , s e S, </> G *, p e P 
p2p-RECV(m)q jP , m G X, q,p G T 7 

Internal: createconf(c), c G C 

State: 

created G 2 C , init {co} 
for each p E P: 

cur-cid\p] G £/j_, init go if P G Po, -L else 
for each p,g G P,g G 5: 

p2p-msgs\p, g,q]seqof(M), initially 

p2p-nerf[p,g,g] G N >0 , initially 1 



Output: newconf(c) p , c G C, p G c.sei 
newstate(s) p , s e S, p eP 
respond(o, «)p, a G A i G 0/.Dp, p G P 
DELiVER(m, i)p, m G X, i G O/D, p G P 
p2p-SEND(m)p,q, m G -M, q,p G 7', 



for each g d G: 

got-state[g] = V — > <Sj_, init everywhere _l_ 
condenser[g] = P — > 3>j_, init everywhere _l_ 
siaie-di«[g] G 2 7 ', init Po if 3 = So, {} else 
pending[g] G 0, init everywhere _l_ 
attempted[g] G 2 7 ', init Po if 3 = So, {} else 



Derived variables: 

Att G 2 C , defined as {c G created\attempted[c.id] ^ 0} 
£st G 2 C , defined as {c G created\state-dlv[c.id] ^ 0} 



IbtAtt G 2 C , defined as {c G created\c.set C attempted[c.id]} 
TbtSst G 2 C , defined as {c G created|cset C stoie-d/u[c.id]} 



G 2 defined as dead = {c G C|3p G c.sei : cur-cid p > c. id and p ^ attempted[c.id]} . 



Actions: 

internal createconf(c) 
Pre: \/w G created : c.id ^ w.id 
if c ^ dead then 
\/w G created, w.id < c.id: 
w G dead or 

(3x G TbtSst: w.id < x.id < c.id) or 
(3Q G w.qrms: Q C c.set) 
Vw G created, w.id > c.id 
w G dead or 

(3a; G TbtSst: c.id < x.id < w.id) or 
(3Q G c.qrms: Q C w.set) 
Eff: created := created U {c} 

output newconf(c)p, p G c.set 
Pre: c G created 

c.id > cur-cid\p] 
Eff: cur-cid\p] := c.id 

attempted[c.id] := attempted[c.id] U {p} 

input submit-state(s,</))p 
Eff: if c«r-cid[p] / _L and 

goi-siaie[c«r-cid[p]](p) = _l_ then 
got-state[cur-cid\p]](p) := s 
condenser[cur-cid\p]](p) := <?i 

output newstate(s)p choose c 
Pre: c.id = cur-cid\p] 

c G created 

Vg G c.sef, goi-siaie[c.id](<7) ^ _l_ 

let / = condenser[c.id](p)\c.set 

s = f(got-state[c.id]) 

p $l state- dlv[c.id] 
Eff: state- dlv[c. id] := state- dlv[c.id] U {p} 



input suBMiT(m, </>, «)p 
Eff: if cwr-c«d[p] ^ _l_ then 
pending [cur-cid\p]](i) 
:= (m, 0, 0, A(ai) : a: -> _L, false) 

output DELiVER(m, i) p choose g 
Pre: g = cur-cid\p] 

p ^ pending [g](i).dlv 
pending[g](i).msg = m 
Eff: pending [g](i).dlv := pending[g](i).dlv U {p} 

input ACKDLVR(a, j)p 

Eff: if cur-cid\p] =£ _L and 

pending[cur-cid\p]](i) .ack(p) / _l_ then 
pending[cur-cid\p]](i) .ack(p) := a 

output RESPOND(r, j)p choose c,Q 
Pre: c.id = cur-cid\p] 

c G created 

i G OID p 

pending[c.id](i).rsp = false 

Q G c.qrms 

let / = pending [c. id] (i).ack 

Vg G Q : /(?) ^ -L 

r = pending[c.id](i) .cnd(f\Q) 
Eff: pending[c.id](i).rsp := true 

input P2p-SEND(m)p,q 
Eff: if cur-cid\p] ^ _l_ then 

append m to p2p-msgs\p, cur-cid\p],q] 

output p2p-RECv(r«)p j q, choose g 
Pre: g = cur-cid[q] 

mp2p-msgs\p,g,q](p2p-next\p,g,q]) 
Eff: p2p-next\p,g,q] := p2p-next\p,g,q] + 1 



Figure 7-1: The dlc specification 
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configurations c\ and C2, with c\.id < C2.id, either there exists an intervening totally established 
configuration or a quorum of c\ is included in the membership set of ci. Invariant 7.1.1 formalizes 
the above key property (this invariant is given in Section 7.1.3). 

The second difference is that dlc offers also point-to-point communication. That is, a process p 
can send a message m to another process q, provided that both p and q are in the same configura- 
tion. Actions p2p-sEND(m) p ,q and p2p-RECv(m) Pj q of DLC implement the point-to-point communication 
mechanism (this portion of the code is not present in DC). 

The rest of the dlc specification is the same as the DC specification. 

7.1.2 Full description of dlc 

In this section we provide a full description of dlc. The reader who comes from Chapter 6 and 
has read Section 7.1.1 can safely skip this section. The description we provide here is similar to the 
description of the DC service provided in Section 6.2. 

Prior to providing the code for the DC specification, we need some notation and definitions, which 
we introduce in the following. 

Let OID be a set of operation identifiers, partitioned into sets OID p , p € V. We denote by 
M. c C M. the set of messages that clients may use for communication. 

Let A be a set of "acknowledgment" values and let 1Z be a set of "response" values. A condenser 
junction is a function from (V — > Ai_) to 1Z. Let $ be the set of all condenser functions. Let S 
be the set of all possible states of the clients (a state of S does not need to be the entire client's 
state, but it may contain only the relevant information in order for the application to work). The 
DC specification uses a condenser function also to compute the starting state of a new configuration; 
hence we assume that S C A and also S C TZ. Given a function / :V — > D from the set of processes 
V to some domain D and given a subset P C V, we write f\P to denote the function /' : P — > D, 
defined as f'(p) = f(p) for p € P. 

The following data type is used to describe operations: V = Mx^x2 p x(T'— > A±) x Bool 
and we let O = OID — ► V±. Given an operation descriptor, selectors for the components are msg, 
end, dlv, ack, and rsp. 

Next we provide remarks and an informal description of this code. We start with the derived 
variables. 

A configuration c e Att is said to be attempted. For an attempted configuration c there exists at 
least one process p that has executed action newconf(c) p and thus we have that p 6 attempted[c.id]; 
when this holds we say that c is attempted at p or that p has attempted c. A configuration c € Tot Att 
is said to be totally attempted. A totally attempted configuration is a configuration that is attempted 
at all members of the configuration. 

A configuration c G Est is said to be established. For an established configuration c there exists at 
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least one process p that has executed action newstate(s) p and thus we have that p € state- dlv[c. id]; 
when this holds we say that c is established at p or that p has established c. A configuration 
c € IbtEst is said to be totally established. A totally established configuration is a configuration that 
is established at all members of the configuration. 

A dead configuration c is a configuration for which a member process p went on to newer config- 
urations, that is, it executed action newconf(c') p with c'.id > c.id, before receiving the notification, 
that is the newconf(c) p event, for configuration c. 

Now we comment on the transitions. 

Action createconf(c) creates a new configuration c. The first precondition simply requires this 
new configuration to have a brand new identifier. The second precondition of this action is the key 
to our specification. It states that when a configuration c is created it must either be already dead or 
for any other configuration w such that there are no intervening totally established configurations, 
the earlier configuration (i.e., the one with smaller identifier) has at least one quorum included in 
the membership set of the later configuration (i.e., the one with bigger identifier). 

Action newconf(c) p delivers a created configuration c to the client process p. The precondition 
of this action makes sure that configurations are delivered in order of configuration identifiers. We 
notice that because of this precondition, when a configuration c is dead because a process q went on 
to newer configurations, we have that process q can no longer execute action newconf(c),. 

Once a configuration c has been delivered to a client process p, the client process p is supposed 
to submit its current state s and a condenser function </>, by means of action submit-state(s,</>) p . Once 
all the processes have submitted their current states, the condenser function <j> is used to compute 
the starting state of configuration c for process p. The code of this action just memorizes the state 
s and the condenser function <j> for the current configuration of process p. 

Action newstate(s) p computes the starting state for a configuration c. The precondition of this 
action requires that all processes q in the membership of configuration c have submitted their state 
for configuration c. The starting state s of configuration c for process p is then computed by applying 
the condenser function that process p has submitted to the service with action submit-state(s,0) p . 
Variable state- dlv[c.id] records the fact that p has received the starting state for configuration c. 

We remark that for a dead configuration c there is at least one process that does not execute 
action newconf(c) p and thus does not submit its state for c with action submit-state(s,0) p . This implies 
that action NEwsTATE(s)q cannot be executed for any process q. This is why such configurations are 
called "dead". 

The remaining actions are used to handle the requests of clients. We refer to the process of 
handling such a request, which involves the participation of a quorum of processes, as an "operation" . 
To request the execution of an operation a client process p uses action submit^, <t>,i) p . The parameters 
of this actions are as follows: m is a message describing the operation that p needs to perform; 4> is 
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a condenser function to be used to compute a response value for p when a quorum of processes have 
provided acknowledgment values to p's message m; i is an operation identifier needed to distinguish 
operations (every requested operation has a unique operation identifier). We say "operation i" to 
indicate the operation requested with action submit^, </>,j) p . For configuration c and operation i, 
the variable pending [c.id\(i) contains an operation descriptor. The code of action suBMrr(m,</>, j) p sets 
this operation descriptor to a default value. 

We now provide an explanation for each component of an operation descriptor. Let d be an 
operation descriptor for operation i requested by p in configuration c. d.msg is the message that 
describes the request of p; such a message will be delivered to all members of the configuration c. 
d.cnd is the condenser function that will be used to compute the response for the operation once 
a quorum of processes has provided acknowledgment values, d.dlv is the set of processes to which 
the message d.msg has been delivered; initially this is set to an empty set by action suBMrr(m, 4>,i) p . 
d.ack contains the acknowledgment values received; initially this is a vector of ± values. Finally, 
d.rsp is a flag indicating whether or not the client p that requested the operation has received a 
response for the operation. 

Action DELivER(m, i) p delivers the message m of operation i to process p. The code of this action 
updates the operation descriptor d for operation i by adding process p to the set d.dlv. 

Processes that receive the message m for an operation i are supposed to provide an acknowledg- 
ment value a with action ackdlvr(o, i) p . The code of this action records the acknowledgment value a 
of process p into the vector d.ack, where d is the operation descriptor for operation i. 

Action respond^, i) provides a response r to process p for the operation i previously submitted by 
p. The precondition of this action requires that a quorum Q has provided acknowledgment values. 
Then the value r is computed by applying the condenser function provided by p at the time of the 
submission, to the acknowledgment values of processes in Q. At this point the operation has been 
serviced and the rsp component is set to true. 

The code that handles point-to-point communication, is provided by actions p2p-sEND(m) p ,g and 
p2p-REcv(m) p ,q. State variable p2p-msgs\p, g, q] is used to record the messages sent. When action p2p- 
SEND(m) p ,q is executed in a configuration whose identifier is g, message m is added to p2p-msgs\p, g, q] 
which contains a queue of messages sent by p to q in configuration g. Action p2p-RECv(m)q, p delivers 
message m to q. 

7.1.3 Invariant of DLC 

In this section we provide a key invariant of DLC. The property stated by this invariant is used to 
prove correct the applications that we build on top of dlc. 

The second precondition of createconf(c) is the key to our specification. It states that when a 
configuration c is created it must either be already dead or for any other configuration w such that 
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there are no intervening totally established configurations, the earlier configuration (i.e., the one 
with smaller identifier) has one quorum whose members are included in the membership set of the 
later configuration (i.e., the one with bigger identifier). The above precondition enables us to prove 
the following key invariant: 

Invariant 7.1.1 In any reachable state of DLC, the following is true. Let C\,c-i G created \ dead, 
with ci.id < C2.id. Then either there exists w € lbt8st,c\.id < w.id < ci.id, or else there exists a 
quorum Q G c\.qrms such that Q C c-i.set. 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state the invariant is vacuously true because there 
are no two configurations c\ , ci G created such that c\ .id < ci .id. 

For the inductive step assume that the invariant is true in a state s. We need to prove that 
the invariant is true in s 1 for any step (s,ir, s'). The only action that we need to worry about is 
createconf(c), where c = c\ or c = C2, because it creates a new configuration (otherwise the invariant 
is true in s' by the inductive hypothesis). So assume that it =createconf(c). The invariant follows 
easily from the precondition of n. □ 

7.2 Dynamic PAXOS algorithm 

In this section we present dpaxos, an algorithm that solves the consensus problem and that adapts 
well to dynamic distributed systems. The dpaxos algorithm is derived from Lamport's paxos 
algorithm because it uses the same strategy. We refer the reader to the paper by Lamport [61] for a 
complete and colorful presentation of the paxos algorithm; the papers [29, 30] provide a less colorful 
description of the paxos algorithm using the I/O automaton model. For the sake of completeness, 
in this section we provide a brief description of the PAXOS algorithm. 

We first provide a formal definition of the consensus problem in Section 7.2.1. Then in Sec- 
tion 7.2.2 we recall the original paxos algorithm. Section 7.2.3 provides the dpaxos algorithm and 
Section 7.2.4 the proof of correctness. 

7.2.1 Distributed Consensus 

Distributed consensus is a fundamental problem in distributed computation. Roughly speaking 
the problem is the one of reaching agreement among the members of a distributed system; such 
agreement is often necessary for distributed applications (e.g., data replication, airline reservation 
system, distributed transactions). 

Next we give a formal definition of the consensus problem that we consider in this paper. Process 
p starts the computation with an input value v p G X, where X is the set of all possible initial values. 
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Given a particular execution a, we denote by X a C X the set of the initial values of processes in a. 
Each process has a state variable called decision which is a write-once variable initially not 
written. Processes have to write their decision variable in such a way that three conditions are 
satisfied: 

• Agreement: All the written decision variables are set to the same value. 

• Validity: Any written decision variable is set to a value in X a . 

Since the termination condition is a liveness issue and we don't address liveness, we don't provide 
a formal definition. Informally, the termination condition requires that if in a given state s a 
configuration c is totally established, c is the configuration with the biggest identifier in s. created, 
no other configurations are created after s, all processes of c are alive in state s, and no failures 
happen after s, then all processes that are members of c eventually write the decision variable, 
provided that the execution is a fair execution. 

7.2.2 The original PAXOS algorithm 

The PAXOS algorithm does not use a group communication service, so there are no views or configu- 
rations; it relies on an external leader elector module which provides (unreliable) information about 
the current status of the underlying distributed system, i.e., tells the current membership and the 
leader, to each process. Termination is guaranteed only when there are no failures, and the external 
leader elector provides reliable information on the system, for a sufficiently long time. 

The basic idea is to have processes propose values until one of them is accepted by a majority of 
the processes; that value is the final output value. Any process may propose a value by initiating a 
round for that value. The process initiating a round is said to be the leader of that round while all 
processes, including the leader itself, are said to be agents for that round. 

Since different rounds may be carried out concurrently (the leader elector is not reliable hence 
several processes may concurrently consider themselves leaders) , we need to distinguish them. Every 
round has a unique identifier. Next we formally define these round identifiers. A round number is a 
pair (x, i) where a; is a nonnegative integer and iis a process identifier. The set of round numbers 
is denoted by 1Z. A total order on elements of 1Z is defined by (x, i) < (y,j) iff x < y or, x = y and 
i <j. 

We say that round r precedes round r' if r < r' . If round r precedes round r' then we also say 
that r is a previous round, with respect to round r'. We remark that the ordering of rounds is not 
related to the actual time the rounds are conducted. It is possible that a round r' is started at some 
point in time and a previous round r, that is, one with r < r', is started later on. 

Every round in the algorithm is tagged with a unique round number. Every message sent by the 
leader or by an agent for a round (with round number) r G 1Z, carries the round number r so that 
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no confusion among messages belonging to different rounds is possible. 
Informally, the steps for a round are the following. 

1. To initiate a round, the leader sends a "Collect" message to all agents announcing that it 
wants to start a new round with round number r and at the same time asking for information 
about previous rounds in which agents may have been involved. 

2. An agent that receives a message sent in step 1 from the leader of the round, responds with 
a "Last" message giving its own information about rounds previously conducted, namely the 
last round r' for which the agent made a commitment and the value v of that round. With 
this, the agent makes a kind of commitment for this particular round that may prevent it from 
accepting (in step 4) the value proposed in some other round. If the agent is already committed 
for a round with a bigger round number then it informs the leader of its commitment with an 
"OldRound" message. 

3. Once the leader has gathered information about previous rounds from a majority of agents, it 
decides, according to some rules, the value to propose for its round and sends to all agents a 
"Begin" message announcing the value v for round r and asking them to accept it. In order 
for the leader to be able to choose a value for the round it is necessary that initial values 
be provided. If no initial value is provided, the leader must wait for an initial value before 
proceeding with step 3. The set of processes from which the leader gathers information is 
called the info-quorum of the round. 

4. An agent that receives a message from the leader of the round sent in step 3, responds with 
an "Accept" message by accepting the value proposed in the current round r, unless it is 
committed for a later round and thus must reject the value proposed in the current round. In 
the latter case the agent sends an "OldRound" message to the leader indicating the round r' 
for which it is committed. 

5. If the leader gets "Accept" messages from a majority of agents, then the leader sets its own 
output value to the value proposed in the round. At this point the round is successful. The 
set of agents that accept the value proposed by the leader is called the accepting- quorum. 

Since a successful round implies that the leader of the round reaches a decision, after a successful 
round the leader still needs to do something, namely to broadcast the decision. Thus, once the 
leader has made a decision it broadcasts a "Success" message announcing the value for which it has 
decided. An agent that receives a "Success" message from the leader makes its decision choosing 
the value of the successful round. We use also an "Ack" message sent from the agent to the leader, 
so that the leader can make sure that everyone knows the outcome. 
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The most important issue is about the values that leaders propose for their rounds. Indeed, 
since the value of a successful round is the output value of some processes, we must guarantee 
that the values of successful rounds are all equal in order to satisfy the agreement condition of 
the consensus problem. This is the tricky part of the algorithm and basically all the difficulties 
derive from solving this problem. Consistency is guaranteed by choosing the values of new rounds 
exploiting the information about previous rounds from at least a majority of the agents so that, for 
any two rounds, there is at least one process that participated in both rounds. 

In more detail, the leader of a round chooses the value for the round in the following way. In 
step 1, the leader asks for information and in step 2 an agent responds with the number of the latest 
round in which it accepted the value and with the accepted value or with round number (0, j) and 
± if the agent has not yet accepted a value. Once the leader gets such information from a majority 
of the agents (which is the info-quorum of the round) , it chooses the value for its round to be equal 
to the value of the latest round among all those it has heard from the agents in the info-quorum or 
equal to its initial value if all agents in the info-quorum were not involved in any previous round. 
Moreover, in order to keep consistency, if an agent tells the leader of a round r that the last round in 
which it accepted a value is round r', r' < r, then implicitly the agent commits itself not to accept 
any value proposed in any other round r", r' < r" < r. 

Given the above setting, if r' is the round from which the leader of round r gets the value for 
its round, then, when a value for round r has been chosen, any round r", r' < r" < r, cannot be 
successful; indeed at least a majority of the processes are committed for round r, which implies 
that at least a majority of the processes are rejecting round r". This, along with the fact that 
info-quorums and accepting-quorums are majorities, implies that if a round r is successful, then 
any round with a bigger round number f > r is for the same value. Indeed the information sent 
by processes in the info-quorum of round f is used to choose the value for the round, but since 
info-quorums and accepting-quorums share at least one process, at least one of the processes in the 
info-quorum of round r' is also in the accepting-quorum of round r. Indeed, since the round is 
successful, the accepting-quorum is a majority. This implies that the value of any round f > r must 
be equal to the value of round r, which, in turn, implies agreement. 

Instead of majorities for info-quorums and accepting-quorums, any quorum system can be used 
(DPAXOS uses the quorum system of the configuration). Indeed the only property that is required is 
that there be a process in the intersection of any info-quorum with any accepting-quorum. 

To end up with a decision value, rounds must be started until at least one is successful. 

7.2.3 The DPAXOS algorithm 

The dpaxos algorithm borrows the basic ideas of the paxos algorithm, but it is built upon the 
dlc group communication service. Thus it exploits the properties guaranteed by such a service. A 
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round of the dpaxos algorithm is similar to that of the paxos algorithm with the following major 
differences: (i) since any time that the leader changes the dlc service provides a new configuration, 
we have that at most one round is conducted in each configuration (hence we do not distinguish 
rounds and configurations); (ii) the first part of a round, whose purpose is to find a value that 
the leader proposes in the round, is no longer necessary, because the group communication service 
provides, with the starting state of a new configuration, a value that can be safely proposed by the 
leader. Thus in dpaxos the leader needs only to send "Begin" messages and collect the "Accept" 
messages. 

Because of (i) we no longer need to worry about processes committing to reject rounds or to 
make sure that messages belonging to different rounds do not interfere: by the properties of the dlc 
group communication service we have that processes receive and send messages only in the current 
configuration. Since in each configuration only one round is conducted, no interferences are possible 
and older rounds are automatically rejected. Configuration identifiers can be used as round numbers 
(and viceversa). Because of (ii) a round of dpaxos is shorter than a round of paxos (the first part 
of the round is basically done while establishing the new configuration). 

As a result of the above, the code of dpaxos is simpler and much shorter than the code of paxos 
as implemented in [29, 30]. 

Since in each configuration only one round is run, round numbers in DPAXOS are configuration 
identifiers. Hence the set of round numbers is 1Z = Q. We say that round r precedes round r' if 
r < r' . If round r precedes round r' then we also say that r is a previous round, with respect to 
round r' . 

For the dpaxos algorithm, the set S consists of pairs (r, v), where r e Q and v € X. 

The code, dlc-to-paxos p , is shown in Figure 7-2. The overall system dpaxos consists of the 
composition of dlc and automaton dlc-to-paxos p for each p € V. 

Next we provide an informal description of the code. 

We start by describing the state variables. Variable current p contains the current configuration 
for process p; if process p runs a round, then the round number is given by current.id p . Variable 
rnd-val p contains the value that process p proposes in the current round. Variable decision p contains 
the decision of process p. Variable used-ids p C OID p is a set of identifiers used to distinguish 
operations that process p submits to the dlc service. Variable last-rnd p contains the last round for 
which process p has accepted the value. Variable last-val contains the value of round last-rnd. 

The dlc-to-paxoSj, has a reconfiguration phase, which starts when process p is notified of a new 
configuration and ends when process p is given the starting state for the new configuration. When 
not in reconfiguration phase process p performs normal processing. Variable status p is used to switch 
from reconfiguration phase to normal processing. For normal processing we have status p = normal. 

Variable mode p is used by the leader to go through the steps of a round. 
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DLC-TO-PAXOS 



Signature: 

Input: newconf(c) p , c G C, p G c.set 
newstate(s) p , s e S, p eV 
deliver(to, i) p , m G M, i G OID, ; 
RESPOND(a, i) p , a e A i £ 0/D p , p 
p2p-RECv(m)q jP , m G X, g,p 6 T 7 

State: 

current G Cj_ init c c if p G -Po> else _l_ 

rnd-val G ^f , initially u p 

decision G .-t , initially _l_ 

used-ids C O/D, initially 

ack-q, seqof(A X OID), init A 

ack-d, either "ack" or _l_, init _l_ 

Actions: 



input newconf(c) p 
Eff: current := c 
status := exch- 
ack-q := X 

if p = c.ldr and mode 
mode := newround 






/ decided then 



output SUBMIT-STATE({r, v),(p sta te)p 

Pre: status = exch-ready 

r = last-rnd 

v = last-val 
Eff: status := exch-wait 

input NEWSTATE({r, l)))p 

Eff: status := normal 

last-rnd := current.id 
last-val := v 
rnd-value := v 
Hfrom(current .id) = r 
Hvalue(current.id) = v 



■p 



internal NEWROUND 
Pre: mode = newround 

status = normal 
Eff: mode := begincast 

output sUBMrr(("oegm",^), <pu g i n , 
Pre: current ^ _l_ 

j ^ used-ids 

status = normal 

v = rnd-value 

mode = begincast 
Eff: used-ids := used-ids U {«} 

mode := waj£ 



«)p 



Internal: newround p , p G P 

Output: suBMiT(ra,0,i)p, m G -M, </> G *, p G P, « G O/Dp 

ACKDLVR(a, i)p, a G A, j G O/D, p£? 

SUBMIT-STATE(s,(/))p, sG5, </> G *, p G "P 

p2p-SEND(m)p,q, m G X, g,p G P 

last-rnd G 5, initially go if P £ Po> else _l_ 

last-val G .-t, initially d p 

status G {normal , exch-ready , exch-wait] , init normal 

mode G {wait, newround, collect, begincast, decided}, 

init newround for p = co-ldr, wait else 
ack- success (q), a boolean, initially false for all g G "P 



input DELiVER(("oegin",^), i)p 
Eff: append ({"accept"), i) to acfc-q 
last-rnd := current.id 
last-val := v 

output ACKDLVR(a, j)p 

Pre: head(ack-q) = (a,i) 
Eff: acfc-q := tail(ack-q) 

input resp Oim((" begin" ,Q),i) p 
Eff: decision := rnd-value 
mode := decided 
Haccquo (current. id) := Q 

output P2P-SEND(^)p jg 

Pre: mode = decided 

v = decision 

ack- success (q) = false 
Eff: none 

input P2P-RECV(^)q,p 

Eff: decision := v 

mode := decided 
ack-d := "ack" 

output p2p-SEND("acA:")p,q 
Pre: ack-d = "ack" 
Eff: ack-d := _L 

input p2p-RECv("acA;")q,p 
Eff: ack- success (q) := true 



Figure 7-2: The dlc-to-paxos code. 
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Variable ack-q p is a queue of acknowledgment values to be sent to the dlc service. Variable 
ack-d p is used to send back to the leader an acknowledgment for the decision. Variable success-ack p 
is used by the leader to record those processes that have sent an acknowledgment for the decision. 

Next we describe the transitions. We start with the transitions for the reconfiguration phase. 

Action newconf(c) p notifies process p of a new configuration c. Process p enters the reconfiguration 
phase by changing its status to exch-ready. It also resets queue ack-q in order to stop sending 
acknowledgments for older rounds. Moreover if process p is the leader of the new configuration and 
it has not yet reached a decision, then it prepares itself, by setting mode p to newround, to start a 
new round when the normal processing mode will be re-entered. 

With action submit-state({7-, v),4> sta te) P process p submits its current state to the dlc service. 
The relevant information submitted to the service consists of the last round r for which process 
p has accepted a value (i.e., has sent an "accept" message) and the value v of that round. The 
condenser function Estate collects all the states submitted by processes in the current configuration 
and computes the value to propose in the current round. Formally it is a function that takes a set 
of pairs W = {(r, v)\r £§,»£ X} and returns a pair (r',v') S W where r' is such that r' > r for 
all (r, v) 6 W. At this point process p has to wait for the dlc service to deliver the starting state. 
Hence it sets status p to exch-wait. 

Action NEwsTATE({r,i)» p delivers to process p the starting state for p's current configuration. This 
starting state contains the value v to propose in the round for the current configuration. Normal 
processing is re-entered by setting status p = normal. 

Next we describe the transitions for normal processing, where rounds are run. 

Action newroundp starts a new round; in order to do this, the leader of the current configuration, 
say process p, must be ready to start a new round, that is mode p must be equal to newround. The 
effect is just to change the mode p to begincast. Process p is now ready to send a begin message for 
the current round. 

Action svBMiT(("begin" ,v),<t> begin ,i) p takes care of sending the begin message through the dlc service. 
The condenser function (fitegin is a function that takes a set of acknowledgment values {{"accept" ,i) q \q € 
Q} for some set of processes Q C V and returns {"begin" ,Q). After the execution of this action 
process p has mode p = wait because it needs to wait for acknowledgment values. 

Action deliver({" begin", v),i) p is an input action from the DLC service, which delivers the begin mes- 
sage from another process. The effect of this action is to put the acknowledgment value {"accept" , i) 
into the queue ack-q , from which it will be sent back to the DLC service by action ackdlvr(o, i) p . 

Action respond({" begin", Q),i) p is an input action from the DLC service which provides the response 
to the "begin" message previously sent with action svBMn({"begin",v),(j> b< , g i n ,i) p . This action tells the 
leader that processes in the quorum Q have accepted the current round. At this point the leader 
can make a decision and thus sets the decision p variable to the value rnd-value p proposed in the 
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current round. 

The remaining actions are used to spread a decision to all members of the current configuration 
once the leader has reached a decision. Action p2p-sEND(^) Pj q is used by the leader p to send the 
decision v = decision p to a process q that does not know yet the decision. Action p2p-RECv(«)j, >g is 
executed by process q when it receives a decision v from the leader p; process q sets its decision q 
variable and sets its mode q to decided. Then, process q sends an acknowledgment back to the 
leader with action p2p-sEND("acA:")q j p and this acknowledgment is received by the leader with action 
p2p-RECv("acA:")q,p. 

We augment the code with the following history variables: 

• Hvalue(r) p E^fUl, initially v p for r = go and p = c$.ldr and _L elsewhere. This variable 
records the value for round/ configuration r. 

• Hfrom(r) p € 7?,U _L, initially ± for all r,p. This variable records the round/configuration from 
which Hvalue(r) is taken. 

• Haccquo{r) p , a subset of V or ±, initially ± for all r, p. This variable records the accepting- 
quorum of round r. 

We conclude with a few remarks. Submitting the state for a new configuration corresponds 
to sending a "Last" message in the original paxos algorithm. Notice that there is no need to 
commit to reject older rounds, because this is automatically guaranteed by the configuration oriented 
communication of the DLC service. Submitting the begin message to the DLC service corresponds to 
broadcasting a "Begin" message in the original paxos algorithm. Sending the acknowledgment value 
{"accept", i) for a "begin" operation, corresponds to sending an "Accept" message in the original 
paxos algorithm. 

Finally we remark that the dlc service encapsulates in the state-exchange mechanism part of 
the paxos algorithm. This is done by means of the condenser function (f> s tate which computes the 
value of the latest configuration for which processes member of the new configuration have accepted 
a value. This computation is a key point in the original PAXOS algorithm. 

7.2.4 Proof of correctness for DPAXOS 

In this section, we prove the correctness of dpaxos. We recall that since at most one round is run in 
a configuration, we use configuration identifiers as round numbers (round numbers are elements of 
the set Q). Also, we say that a configuration c is successful in a state s when s.Haccquo(c.id) ^ ±; 
informally this means that the round conducted in the configuration is successful, and a decision is 
made by the leader. 

Next we provide invariants needed to prove agreement. We start with some basic invariants. 
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Invariant 7.2.1 (DPAXOS,) 

In any reachable state the following is true. Let c be a configuration established at a process p and 

such that c.id > go. Then Hvalue(c.id) p ^ ±. 

Proof: Since configuration c is established at p and c.id > go, we have that action newstate(7-,«)j, 
for some r e Q and v £ X, with »^1, has been executed (this is not true for c = Co). By the code 
of this action we have that Hvalue(c.id) p = v. □ 

Invariant 7.2.2 (DPAXOS,) 

In any reachable state the following is true. Let c be a successful configuration and let p = c.ldr. 

Then Haccquo(c.id) p 7^ ± and c is established at Haccquo(c.id) p and at p. 

Proof: In order for a configuration c to be successful, the leader p = c.ldr must propose a value. 
Clearly it must be that c is established at p. In order for a configuration c to be successful, the leader 
p must execute action respond({" begin", Q),i) p which sets Haccquo(c.id) p to Q. Clearly any process of 
Q must have established c. □ 

Invariant 7.2.3 (DPAXOS,) 

In any reachable state the following is true. Let c be an established configuration such that go < c.id. 

Then for any two processes p,q that have established c, we have Hvalue(c.id) p = Hvalue(c.id) q . 

Proof: Process q sets the Hvalue(c.id) q to v when NEwsTATE({r,^))q for configuration c is executed. 
Process p sets the Hvalue(c.id) p to v' when newstate({7-',?/))p for configuration c is executed. By the 
code of the dlc service, every process gets the same state for configuration c, that is r' = r and 

v 1 = v. □ 

The following invariant states that when a configuration c is successful, any other configuration 
up to (and including) the next totally established configuration is for the same value as c. 

Invariant 7.2.4 ("dpaxosJ 

In any reachable state the following is true. Let c be a successful configuration. Then for any 
configuration c' established at a process q, with c.id < c' .id and such that there are no totally 
established configurations in between c and c' , we have that Hvalue(c'.id) q = Hvalue(c.id) c .ur 7^ -L- 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state the invariant is vacuously true because there 
is no successful configuration. 

For the inductive step assume that the invariant is true in a state s. We need to prove that the 
invariant is true in s 1 for any step (s, 7r, s'). Consider s 1 and fix c and c' as required by the statement 
in s'. Let p = c.ldr. By Invariant 7.2.2, in state s configuration c is established at p and at the 
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accepting-quorum Haccquo(c.id) p = Q G c.qrms. So we have that Q C s.state-dlv[c.id\. 
Now we distinguish two cases: (i) configuration c' is established at q in state s, (ii) configuration c' 
is not established at q in state s. In the former case the invariant follows by the inductive hypothesis. 
We need to consider the latter case. Since c' is not established at q in s and is established at q in s' 
it must be the case that it =newstate({7-, v)) q for some r G Q and v € X. Configurations c and c' are 
not dead in s', as well as in s, because they are established in s'; moreover, by assumption, we have 
that in s' there is no totally established configuration between c and c'. Hence, by Invariant 7.1.1, 
there exists a quorum Q 1 G c.qrms such that Q 1 C c' .set. By the properties of quorums, there 
exists a process q' G Q n Q'. For such a process we have that q>' € c'.set and q' € s.state-dlv[c.id\. 
Let s" be the state in which process q' executes action submit-state({7-V)) that submits the state 
of q' to the condenser function 4> s tate for c'; since q' G s.state-dlv[c.id] and in state s" we have 
that currentqi = c' .id, it must be the case that q' G s".stote-dfc[c.«d] (because q>' will not execute 
any other action for configuration c once its current configuration is c'). Hence the pair (r\v') 
that q' submits to the condenser function (f> s tate for c' is such that r' > c.id. By the definition of 
Estate we have that the pair (r, v) returned by action n is such that r > c.id. Hence we have that 
s' .Hfrom(c'.id) q > c.id. 

Let q" be the process from which the 4> s tate function takes the pair (r, v) returned with action 7r; thus 
v = s' .Hvalue(r) q ii . By the code we have that s 1 .Hfrom(c'.id) q = r and that s'.Hvalue(c' .id) q = v. 
Hence s'.Hvalue(c' .id) q = s' .Hvalue(r) q n . 

If r = c.id then we consider two cases. 
Case (i): r = go- Then we claim that q" = p. Indeed if the configuration with the biggest identifier 
among those submitted to the condenser function for c' is Co, this means all members of c', when 
they submit their state to the condenser function, have not established any other configurations with 
identifier greater than go- This implies that all processes except p submit (_l_, •) to the condenser 
function 4> s tate and p submits (g<y,v p ). Hence p is selected by the condenser function 4> s tate- Thus 
we have s' .Hvalue(c' .id) q = s' .Hvalue(c.id) p , as needed. 

Case (ii): r > go. Obviously we have that s 1 .Hvalue(c' .id) q = s' .Hvalue(c.id) q n . By Invariant 7.2.3 
we have that s' .Hvalue(c.id) q n = s 1 .Hvalue(c.id) p , and the invariant holds in this case. 

It remains to consider the case r > c.id. By the inductive hypothesis applied to c, r and q" we have 
that s' .Hvalue(r) q ii = s 1 .Hvalue(c.id) p . Hence we conclude that s' .Hvalue(c' .id) q = s' .Hvalue(c.id) p , 
as needed. □ 

The following invariant is similar to the previous one, but considers totally established configu- 
rations instead of successful ones. It states that when a configuration c is totally established, any 
other configuration up to (and including) the next totally established configuration is such that its 
Rvalue is the same as that of c. First we give an auxiliary invariant. 
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Invariant 7.2.5 (DPAXOS,) 

In any reachable state the following is true. Let c be a totally established configuration such that 
go < c.id. Then for any configuration c' established at a process q, with c.id < c'.id and such that 
there are no totally established configurations in between c and c' , we have that Hvalue(c' .id) q = 
Hvalue(c.id) c jd r . 

Proof: By induction on the length of the execution. The base case consists of proving that the 
invariant is true in the initial state. In the initial state the invariant is vacuously true because there 
is no established configuration c such that go < c.id. 

For the inductive step assume that the invariant is true in a state s. We need to prove that the 
invariant is true in s 1 for any step (s, 7r, s'). Consider state s 1 and let c and c' be as required by the 
statement in state s'. Let p = c.ldr. We distinguish four possible cases. 

CASE 1: c G s.lbtEst and c' is established at q in s. Then we can apply the inductive hypothesis. 

CASE 2: c G s.lbtEst and c' is not established at q in s. Then it must be the case that 
7r =NEwsTATE({r,«))<j. This action sets Hvalue(c' .id) q = v. 

Since c' is established in s' it is also not dead. Clearly also c is not dead in s'. By Invariant 7.1.1 
we have that there is a quorum Q of c such that Q G c 1 .set. Let q' G Q. For such a process we have 
that q' G s 1 .state- dlv[c.id], q' G c' .set. 

Let s" be the state in which process q' executes action submit-state^V)) that submits the state 
of q' to the condenser function <j) s tate for c'; since q' G s. state- dlv[c.id] and in state s" we have that 
current q = c'.id it must be the case that q 1 G s" .state-dlv[c.id] (because q 1 will not execute any other 
action for configuration c once its current configuration is c'). Hence the pair (r' \v') that q 1 submits 
to the condenser function 4> s tate for c' is such that r' > c.id. By the definition of 4> s tate we have that 
the pair (r, v) returned by action 7r is such that r > c.id. Hence we have that s 1 .Hfrom(c' .id) q > c.id. 
Let q" be the process from which the (f> s tate function takes the pair (r, v) returned with action n; thus 
v = s 1 .Hvalue(r) q " . By the code we have that s 1 .Hfrom(c' .id) q = r and that s 1 .Hvalue(c' .id) q = v. 
Hence s'.Hvalue(c' .id) q = s' .Hvalue(r) q ii . 

If r = c.id then, we have that s 1 .Hvalue(c' .id) q = s' .Hvalue(c.id) q ii . By Invariant 7.2.3 we have 
that s 1 .Hvalue(c.id) q ii = s'.Hvalue(c.id) p and thus the invariant holds. So consider the case r > c.id. 
By the inductive hypothesis applied to c and r and q" we have that s'.Hvalue(r) q n = s 1 .Hvalue(c.id) p . 
Hence we conclude that s 1 .Rvalue (c'.id) q = s' .Hvalue(c.id) p . 

CASE 3: c ^ s.7bt£st, c' is established at q in s. Then it must be the case that n =NEwsTATE({r,v)) p , 
for some process p' that totally establishes configuration c. Configurations c and c' are not dead in 
s' . By Invariant 7.1.1 we have that there is a quorum Q of c such that Q G c' .set. Let q' G Q. For 
such a process we have that q' G s' .state- dlv[c.id\, q' G c' .set. 

The proof proceeds as in the previous case: Let s" be the state in which process q' executes action 
suBMrr-STATE({r'X)) which submits the state of q' to the condenser function 4> s tate for c'; etc. (as done 
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in the previous case). 

case 4: c ^ s.TbtSst, c' not established at q in s. This is not possible because a single action 
cannot make both c totally established and c' established. □ 

The following invariant follows easily from the previous one. 

Invariant 7.2.6 (^DPAXOSj 

In any reachable state the following is true. Let c be a totally established configuration such that 
go < c.id. Then for any configuration c' established at c'.ldr, with c.id < c' .id, we have that 
Hvalue(c' .id) c i jdr = Hvalue(c.id) c .idr ■ 

Proof: Let c and c' be as required by the statement. Let ci,C2,...,Ck be the sequence in order 
of configuration identifiers of the totally established configurations properly between c and c'. By 
Invariant 7.2.5 we have that Hvalue(ci.id) ci .idr = Hvalue(c.id) c jdr', by the same invariant we have 
Hvalue(c2.id) C2 jdr = Hvalue(ci.id) Cl .idr and so on up to Hvalue(c' .id) c i .ur — Hvalue(ck.id) Ck .id r . 
Thus we have that Hvalue(c' .id) c i jdr = Hvalue(c.id) c .id r - D 

The following invariant is crucial to proving agreement. 

Invariant 7.2.7 (DPAXOS,) 

In any reachable state the following is true. Let c be a successful configuration. Then for any config- 
uration c' established at c'.ldr, with c.id < c' .id, we have that Hvalue(c' .id) c i .ur — Hvalue(c.id) c .ur ■ 

Proof: If there are no totally established configurations between c and c' then the invariant follows 
directly from Invariant 7.2.4. So assume that there exists at least one totally established with 
identifier strictly greater than c.id and strictly smaller than c' .id. Let c* be the totally established 
configuration having the smallest identifier strictly greater than c.id and let q be c*.ldr. Clearly we 
have that c* is established at q. By definition of c* there are no totally established configurations 
between c and c*. By Invariant 7.2.4 we have that Hvalue(c*.id) q = Hvalue(c.id) c jd r . 
Since c.id > go, we have c* .id > go- Hence by Invariant 7.2.6 we have that Hvalue(c' Ad) c i jd r = 
Hvalue(c*.id) q . Hence Hvalue(c' .id) c i .ur — Hvalue(c.id) c .id r , as needed. □ 

We are now ready to prove agreement. 

Theorem 7.2.8 In any execution of the system DPAXOS, agreement is satisfied. 

Proof: In order to prove agreement we need to show that all the decision variables are set to 
the same value. By the code it is immediate that decision variables are always set to be equal 
to Hvalue(c.id) c .idr for some successful configuration c. Hence it is enough to prove that any two 
successful configurations c and c' are such that s .Hvalue(c.id) c jd r = s.Hvalue(c' .id) c i jd r . 

Let p = c.idr and p 1 = c'.ldr and without loss of generality assume that c.id < c' Ad. By 
Invariant 7.2.7 we have that s.Hvalue(c.id) p = s.Hvalue(c' .id) p i . □ 

130 



Validity is easier to prove since the value proposed in any round comes either from an initial 
value or from a previous round. 

Invariant 7.2.9 (DPAXOS,) 

In any reachable state of an execution a, for any round r such that Hvalue(r) r .ur 7^ -L, we have that 

Hvalue(r) r .idr £ %a- 

Proof: By induction on the length of the execution a. The base case consists of proving that the 
invariant is true in the initial state. In the initial state Hvalue(r) p is not ± only for r = go and 
p = r.ldr. Moreover Hvalue(go) p is equal to the initial value of p. Hence the assertion is true. 

For the inductive step assume that the invariant is true in a reachable state s. We need to prove 
that the invariant is still true in s' for any possible step (s,ir,s'). 

Clearly the only actions that can make the assertion false are those that set Hvalue(r) p for some 
round r and p = r.ldr. The only action that sets Hvalue(r) p is action n =newstate({t-' ,v)) p for round 
r. Action n sets Hvalue(r) p to v. We need to prove that v G X a . This follows from the definition of 
4>state an d the fact that the values submitted to <j> s tate are the last-val q variables which, in turn, are 
either the initial value v q of q or the value Hvalue(r') r i j,j r of a previous round r'; by the inductive 
hypothesis we have that Hvalue(r') r i j^r belongs to X a . Q 

Theorem 7.2.10 In any execution of the system DPAXOS, validity is satisfied. 

Proof: Let a be an execution of dpaxos. A variable decision is always set to be equal to some 
Hvalue(r) p ^ ± for some r and p = r.ldr or to some other decision variable. By Invariant 7.2.9 we 
have that Hvalue(r) p belongs to X a . Hence validity is satisfied □ 

Finally we claim, informally, that termination is satisfied. We remark that we are making the 
assumption that any failure in the system is detected by the group communication service which 
changes the configuration in order to reflect the new status of the underlying distributed system. 

Consider an execution of the system such that there exists a state s in which a configuration c 
becomes totally established. Let t the point in time at which the system enters state s. Assume that 
there are no failures after time t. There is nothing that can prevent the round run in configuration 
c from success. Thus the leader of configuration c eventually writes its own decision variable. Once 
having done that, the leader keeps sending (see code) the value of its decision variable to any other 
process member of the configuration until it receives an acknowledgment. Since there are no failures 
every member of the configuration c will eventually receive the message from the leader and write 
the decision variable. 
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7.2.5 Remarks 

We remark that the point-to-point communication mechanism of DLC is used by DLC-TO-PAXOS just 
to spread a reached decision to all members of the current configuration. Though it is fine to use 
configuration synchronous point-to-point messages, there is no need to require that messages used to 
spread the decision be configuration synchronous. A regular point-to-point channel which delivers 
messages regardless of the configurations in which the sender or the receiver are works fine too. 
Hence for the dpaxos algorithm we could use a weaker version of the dlc service which provides 
point-to-point messages without configuration synchrony. We have used this stronger version because 
the algorithm that we present in the next section needs configuration synchronous point-to-point 
messages. 

The original PAXOS algorithm [61] is designed to work with majorities or with more general 
quorums of a static universe of processes. Using quorums is good for handling transient failures of a 
system. However it does not work well for permanent failures. The usefulness of building PAXOS over 
the dlc group communication service is that it can adapt also to permanent failures by changing 
the configuration of the system. 

It would be useful to compare the performance of the paxos algorithm built on top of dlc with 
that of the original paxos algorithm. Since our work has not addressed performance issues we leave 
this as future work. 

The same technique that we have used to build paxos on top of dlc could be used to build 
multipaxos on top of dlc. The multipaxos algorithm [61] 1 is basically a sequence of instances 
of the paxos algorithm that run together and optimize the number of messages needed in the first 
part of the round. The optimization is achieved by sending a unique message that works for all the 
instances of paxos. By using the dlc service such an optimization would be obtained by running 
multiple instance of paxos in the same configuration; the state exchange needs to be done only once 
for all the instances. 

7.3 A Replicated Atomic Object Algorithm 

In this section we develop a data replication algorithm that implements a replicated atomic object 
with arbitrary operations (not necessarily just read and write, though in practice these are the most 
common type of operations used). The algorithm, called rab (Replicated Atomic oBject), is built 
upon the dlc service and uses a primary site to handle access to the object. 

We start with an informal description of the algorithm, then we provide the formal code and 
finally we provide key arguments for its correctness. Providing a formal proof of correctness is left 



1 The name multipaxos is actually used in [29]. The original paper by Lamport [61] uses a different name (multi- 
decree parliament protocol). 
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as future work. 

7.3.1 Description of RAB 

Operations are centralized at the leader of the current configuration; the leader requires the collab- 
oration of at least a quorum of processes in order to handle requests. 

Clients of the service request to perform operations on the data. Each process accepts requests 
from its client and places them in a local order. Then each of the received requests is sent to 
the leader who is responsible for building a global order for all the client's requests. For each of 
these requests the leader makes sure that at least a quorum of processes know the request before 
providing an answer to the process that originates the request. Once such an answer is provided to 
the originator process, a response can be given back to the client. 

When a configuration change happens all the members submit their knowledge about the requests 
performed so far and a new common state is computed from the local knowledge of the processes. 
In particular, each process submits its own information about the global order of operations plus 
all the local requests that are still pending, that is, have been submitted to the leader but have not 
received a response. The global orders submitted by each member of the new configuration are used 
to compute the most up to date global order, while the information about pending requests is used 
to locate those operations that must be resubmitted to the leader. 

7.3.2 The code of dlc-to-rab 

In this section we provide the code of algorithm dlc-to-rab. We first define some data types. We 
denote by X the set of values that the shared data can assume and by vo £ X a predefined value. 
The set T is a set of types of operations (e.g., read, write operations). 

The set V of "operation descriptors" is defined as V = {{p, t,w,i)\p £V,t £T,w € X,i € N >0 }. 
Operation descriptors are used to describe both the requests from the clients and the corresponding 
responses. For an element y = {p,t,w,i) of V we use the following selectors to extract the single 
components: y. origin = p, y.type = i, y.param = w and y.local-rank = i. Component y. origin 
records the client at which the request has originated. Component y.type specifies the type of 
operation. For example, if we want a read- write register, types could be T = {"read", "write"}. 
Component y.param provides possible parameters that need to be passed along with the request 
or with the corresponding response. Considering again the case of a read-write register, a write 
needs to pass the value to be written and the response to a read needs to pass the value read. For 
simplicity we assume that only one value needs to be passed and this value is an element of X (this 
is so in the case of a read-write register) . 

The set of messages that can be sent over the point to point channels and through the dlc service 
is defined as M = V U ({"reg", "ans"} x V). The set of operation identifiers is OID = N >0 x V. 
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The set S is defined as S = {{o, d, a, h)\o € array of (V) , d € array of (bool), a € arrayof(bool), h € Q}. 
We remind the reader that the notation array of (V) indicates an array whose elements are either 
elements of V or ±. The set of acknowledgment values is A = {"ack"} U S. The set of response 
values is 1Z = {"done"} U {(o, d, a)\o € arrayof(V),d e array of (bool) , a € arrayof (bool)} . 
The DLC-TO-RAB algorithm uses two condenser functions that we define in the following: 

• tfidone'- This condenser function takes a set of acknowledgment values "ack v and returns the 
string "done". This function is used by the leader to make sure that a quorum of processes 
have received information about a particular operation. 

• 4>rabstate : Let c be the configuration for which this condenser function is to be used. The 
condenser function takes a collection S C S of tuples, one for each member of c, and returns 
a triple (o, d, a) G 1Z, defined as follows: 

— o is defined as follows: Let R be the set of tuples of S that have the maximum high 
component. For any i G N >0 such that there exists at least one element x € R with 
x.order(i) 7^ ±, fix any such element x and set o(i) := x. order (i). For any i for which no 
such element exists set o(i) := ±. 

— d is defined as the "or" of the req-done components in R. 

— a is defined as the "or" of the req-answrd components in R. 

The code of dlc-to-rab is provided in Figure 7-3; we describe it next. We start with the 
description of the state variables. 

Variable current p contains the current configuration of process p and variable high contains the 
latest established configuration of process p. Variable local-req p is the sequence of requests that the 
client submits at process p; variable local-ans p contains the answers for all of the requests. Variable 
next p is a pointer used to insert new requests from the client into local-req p . Variable order p contains 
the sequence of all requests as known by process p. Variable status p contains the status of process 
p; it is used when a new configuration is announced; for regular computation this variable is set to 
normal. 

The remaining state variables are flags used to record that some particular actions have happened. 
Variable req-sent(j) p is set to true when the j th request of the client at p, that is local-req(j) p , 
is sent to the leader of the current configuration. Such a request, when received by the leader 
is placed in the global sequence of request order into some available position, say i. Variable 
req-sbmttd(i) p is set to true when the leader p has submitted the request via the DLC service 
with action suBMrr(m,</>d ne:(«:P))- Variable req-acked(i) p is set to true when process p has sent an 
acknowledgment for the i th request. Variable req-done(i) p is set to true when the leader p receives 
the response from the dlc service for the i th request with action RESPOND("done",{i,p)). Variable 
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Signature: 

Input: REAB(desc,param) p , desc G V,param G X,p G V 
p2p-recv(to) p , to G -M, p G P 
deliver(to, i) p , m G M, i G O/D, p G V 
respond(o, i) p , a G A, i G O-f-Dp, p G P 
newconf(c) p , c G C, p G c.sei 
newstate(s) p , s g 5, p G "P 

State: 

current G Cj_ initially c Q if p G Po> else _l_ 

/wp/j G Cj_ initially _l_ 

local-req G array of (V), initially _l_ everywhere 

local-ans G array of {T>), initially _l_ everywhere 

raezi G N, initially 1 

order G array of (J)), initially _l_ everywhere 

status G {normal, exch-ready,exch-wait}, initially normal 



Output: C0NFiRM(param) p , param G .-t, p G P 
p2p-send(to) p , to G -M, pe? 
sUBMiT(m, 4>, i) p , m G -M, 4> £ *:P G "P, i G 
ACKDLVR(a, i) p , a G A i G OID, p G P 

submit-state(s,</>)p, sG5, </>G*, pGP 

req-sent G seqof(bool), initially false everywhere 
req-sbmttd G seqof(bool), initially false everywhere 
req-acked G seqof(bool), initially false everywhere 
req-done G seqof(bool), initially false everywhere 
req-answrd G seqof(bool), initially false everywhere 
req-cnfrmd G seqof(bool), initially false everywhere 
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Derived variable: 

apply-all(i) p is defined as follows: 

if for all k < i, order(k) p ^ _l_ then 

apply-all(i) p = (q,t,w,j), where q = order (i) p . origin, t - 
w is the value obtained by applying operations order(l, 
and j = order {i) v . local-rank; 

else 

ll(i) P = ±. 



order (i) p. type, 
, i) p to the initial value vq in order, 



Actions: 

input REQUEST(tj/pe, param) p 
Eff: local-req(next) := {p, type, param, next) 
next := next + 1 

output confirm (param) p 
Pre: local-ans(i) = (p, type, param, i) 
Vj < i, req-cnfrmd(j) = true 
req-cnfrmd(i) = false 
Eff: req-cnfrmd(i) := true 

output p2p-SEND({"req" ,m))p, q choose j 
Pre: current ^ _l_ 

q = current. Idr 

m = local-req(j) 

m.local-rank = j 

req-sent(j) = false 

status = normal 
Eff: req-sent(j) := true 

input p2p-RECV({"req", m)) 9 ,p 
Eff: Let % be such that 

VA: < i, order(k) / _l_ 
order{i) = _l_ 
order(i) := m 

output SUBMIT(m,</>done: (*:P))p 

Pre: current ^ _l_ 
p = current. Idr 
order(i) = m 

req-sbmttd(i) = false 
status = normal 
Eff: req-sbmttd{i) := true 



input DELiVER(m, (i,r))p 
Eff: order(i) := m 

req-acked(i) := false 

output ACKDLVR("acA:", (i,r)) p 
Pre: r = current. Idr 

Vj < i, order(j) ^ _l_ 

req-acked{i) = false 

status = normal 
Eff: req-acked{i) := true 

input respond( "done", (i,p)) p 
Eff: req-done{i) := true 

output p2p-SEND({"ans", a))p >9 choose 
Pre: p = current. Idr 

order(i) = {m,j) for some m 
req-answrd (i) = false 
req-done{i) = true 
VA; < j, order(k) / _L 
VA < j, req-answrd(k) = true 
g= order (i). origin 
a = apply-all{i) 
status = normal 
Eff: req-answrd(i) := true 

input p2p-RECv({"ans", a)) 9 ,p 
Eff: j := a. local-rank 
local-ans(j) := a 



Figure 7-3: The dlc-to-rab code. 
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DLC-TO-RAB 



input newconf(c) p 
Eff: current := c 

status := exch-ready 

Vi, order(i) ^ _l_ and req-acked(i) = false 
req-acked(i) := true 

output submit-state((o, d, a, g),<p T abstate)p 
Pre: status = exch-ready 

o = order 

d = req-done 

a = req-answrd 

g = prev.id 
Eff: status := exch-wait 



input newstate({o, d,a)) v 
Eff: status := normal 
high := current 
order := o 
req-done := d 
req-answrd := a 
\/j,local-req(j) ^ _l_ 
if fit such that 
order (i). origin = p and 
order (i). local-rank = j then 
req-sent(j) := false 
Vi, order(i) / _l_ 
if req-done = false then 
req-submttd(i) := false 
else 
req-submttd(i) := true 



Figure 7-4: The dlc-to-rab code (cont'd). 

req-answrd(i) p is set to true when the leader p has sent an answer for the i th request to the 
originator process of that request. Variable req-cnfrmd(j) p is set to true when process p has given 
the client a response for the j th request submitted at p. 

Next we describe the transitions. 

Action REQUEST(%>e, param) p records a new request from the client at process p in the sequence of 
local requests local-req p . Pointer next p always points to the first available location in the sequence 
local-req . 

Action confirm (param)p provides the response to the requests of the client. Such responses are 
given in the same order as they are received. This is accomplished by using variable req-cnfrmd p ; the 
response to the j th (local) request is given back to the client after all responses for the previous re- 
quests have been provided, and, of course, when the response is available, that is, when local-ans(j) p 
has been set. 

Action p2p-sEND({"req",m)) p ,q is used by process p to send the request m to the leader. Such a 
request is received by the leader with action p2p-REcv({"reg",m))q, p . The leader inserts request m into 
the global order of requests order p in the next available position. 

Once the leaderphas placed arequest in the i th position of order p , it executes action submit^, <Pdone,{i>p))p} 
where m = order(i) p . The leader needs to make sure that at least a quorum of processes learn about 
the request. 

A request m submitted by process q is delivered to process p by the dlc service by means of action 
DELivER(m, (i, r)) p . Upon receiving such a request process p simply updates its own order by placing 
the request m into order(i) p . We remark that the code allows for overwriting a previous value; 
however it is never the case that a process p overwrites an old value of order (i) with something 
different received with action deliver^, {i,r)) p . Flag req- ached (i) p is set to false so that action 
ackdlvr will send an acknowledgment. 
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Action ackdlvr( "acfc", {%, r)) p sends back to the dlc service an acknowledgment for the i th operation 
and sets req-acked(i) p to true. 

Once a quorum of processes has sent acknowledgments for a particular request, the dlc service 
notifies the leader with action RESPOND("done",{j,p))p. The leader simply sets the flag req-done(i) to 
true to record the fact that now request i is known by a quorum of processes. 

Once all the requests up to the i th one are known to a quorum of the processes, the leader can 
send an answer to the originator of the request. This is done in action p2p-sEND({"ans",^aZ))j,,<j. The 
code of this action uses the derived variable apply-all (i) which applies all operations up to the i th 
and returns a tuple a £ V that contains the response for operation i (a.param is the value of the 
shared data after the i th operation). We remark that, of course, a real implementation will only 
keep the current value; it would apply operations in order and provide an answer to operation i right 
after applying it and before applying operation i + 1. 

When process p receives the answer for a request previously sent to the leader it just records the 
answer into local-ans p . This is done in action p2p-recv({ "ans" , a)) q , p . 

Finally we describe the actions used for the state exchange. When a new configuration is an- 
nounced with action newconf(c) p process p sets its current configuration to c and goes into reconfigu- 
ration mode by setting status p to exch-ready. It also sets req-acked p to true for all those operations 
that have pending acknowledgments to be sent; since the old configuration has been left, such an 
acknowledgment must not be sent anymore and setting the flag req-acked p to true has this effect. 
Indeed in the new configuration process p has not yet received any message from the leader so it is 
incorrect to acknowledge a message. 

Then process p submits to the DLC service its order p , req-done , req-answrd and high , which 
constitute the relevant part of the state that has to be exchanged. 

When all the processes have submitted their states, the dlc service is able to compute the starting 
state of the new configuration by using the (f> ra bstate condenser function. Then it gives this state to 
process p by means of action newstate({o, d,a)) p . When this action is executed, process p updates 
its order p , req-done and req-answrd state components. It also adjusts the values of req-sent and 
req- submitted to take care of two problems that arise in establishing a new configuration, as we 
explain below. 

The first problem is that any process p has to check whether all of its local requests are in 
the global order o returned by NEwsTATE({o,d,a)) p ; for any local request not included in the order o, 
process p has to send that local request to the leader because the leader of the new configuration 
does not know about such a request. This is done by setting req-sent(j) to false for those local 
operations that the leader does not know about. 

The other problem regards operations that are included in the order o returned by newstate({o, d, a)) p , 
but for which req-done is still false. For such operations the leader cannot be sure that a quorum 
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of processes have them in their global order and thus cannot provide an answer for such operations. 
The leader needs to resubmit such operations to the dlc service in order to make sure that a quorum 
of processes learn about them, before giving an answer to the originator of the request. 

A simple scenario that illustrates this problem is the following. Assume that quorums are just 
majorities, the initial configuration has membership {pi,P2,P3,P4,P5,P6}; the leader is p$ and the 
identifier of the configuration is 1. Process p@ receives a request op\ from its client, puts it into 
its local requests list and sends it to the leader (which in this case is itself). Then process p 6 , 
upon receiving its own request, puts opi into the first position of the global order. Process pe 
submits the request to the dlc service, but before any other process gets its message, a configuration 
change happens. A new process pi joins the system and configuration 2, whose membership set is 
{Pi;P2,.P3,.P4,.P5,.P6;P7}; is created. The leader of this configuration is p-j. Every member submits 
its state. Only pg has something in its global order (namely op\ in position 1) and thus the new 
global order computed by 4> ra bstate for configuration 2 just contains op\ in position 1. Moreover we 
have that req-done for this operation is false because the previous leader did not succeed in having 
a quorum learn about such a request. Assume that the global order for configuration 2 is delivered 
only to processes pg and p7, but not to the other members. The leader pr cannot yet give back a 
response to pe for the request op\ ; indeed if the leader does so it allows process p 6 to give an answer 
back to its client and an inconsistency may arise as we show next. Assume that configuration 3 is 
established. This new configuration has membership set {pi,P2,P3,P4,P5} and leader p\. No one 
in configuration 3 knows that op\ has even been submitted. The global order conveyed by action 
newstate(o, d, a) is empty. A new operation opi comes in, say from process p2, the leader p\ receives 
it, puts it into the first position of the global order, submits it through the dlc service and receives 
acknowledgments from a quorum and gives an answer back to process p%. We have an inconsistency: 
process pi told its client that opi is the first operation applied to the shared object while process p 6 
told its client that operation op\ is the first operation applied to the shared object. Hence, before 
giving a response, the leader of a new configuration needs to submit to the dlc service operations 
that have not successfully gone to a quorum through the DLC service. 

Thus for any i such that order(i) p ^ ± and req-done(i) = false, that is, for any operation in 
order for which the leader cannot be sure that a quorum of processes know about that operation, 
the flag req-submitted(i) is set to false so that the operation will be submitted to the DLC service. 

Another way to get around the problem mentioned above, is to delay the response for an operation 
i, for which the leader does not know that the operation has been spread to a quorum, until a later 
operation j, j > i, is spread to a quorum in the same configuration. When operation j is known 
to a quorum Q, we have that also operation i is known to a quorum because each of the processes 
in Q has to establish the configuration and thus knows operation i. The rab algorithm adopts the 
solution of submitting operation i to the DLC service to make sure a quorum knows it. 
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7.3.3 Sketch of proof of correctness 

In this section we present the key arguments for a correctness proof for the rab algorithm. 

The overall algorithm rab consists of the composition of dlc and automaton dlc-to-rab p 
for each p G V. We claim that the rab algorithm implements an atomic shared object. An 
atomic shared object is an object that can be accessed concurrently by several processes that issue 
invocations (requests) and receive responses for those requests in such a way that it is possible to 
insert serialization points that make the responses consistent with all previous (with respect to the 
serialization points) events. In the RAB algorithm invocation events are the actions requestj, and 
response events are the actions confirm p . We refer the reader to Chapter 13 of [65] for a formal 
definition of atomic object. 

We define the history variable build- order (g, i) p G T>± for each process p G V, each configuration 
identifier g G Q and i € N >0 . Such a variable is ± if process p has not established g, otherwise is 
defined as follows: if current.id p > g then build- order (g,i) p is equal to the value order(i) p when p 
left configuration g; if current.id p = g then build- order (g,i) p = order(i) p . 

Next we provide key invariants that will be used to prove that rab implements an atomic 
object. Remember that variable state- dlv[c. id] contains the set of processes that have established 
configuration c. 

Invariant 7.3.1 In any reachable state the following is true. Let c be a configuration such that 

c.ldr £ state- dlv[c.id\. Then for any p, q G state-dlv[c.id] and any i G N >0 , we have that build- order (c. id, i) p = 

build- order (c. id, i) q . 

Proof: When a configuration is established at a member p, process p executes action newstate({o, d, a)) p 
and sets order p := o. The tuple (o, d, a) is the same for every member of the configuration. So ini- 
tially every member has the same value of order. Within the configuration a member p updates 
order(i) p to a particular value m only when the leader r executes action deliver(™, {i,r)) p ; but since 
the leader has not established c, such an action cannot be executed. □ 

Invariant 7.3.2 In any reachable state the following is true. Let c be a configuration such that 

c.ldr G state-dlv[c.id\. Then for any p G state- dlv[c.id] andanyi G N >0 , wehave that if build-order (c.id,i) p ^ 

± then build- order (c. id, i) p = build- order -(c. id, i) c .idr- 

Proof: When a configuration is established at a member p, process p executes action newstate({o, d, a)) p 
and sets order p := o. The tuple (o, d, a) is the same for every p member of the configuration. So 
initially every member has the same value of order p . Within the configuration a member p updates 
order(i) p to a particular value m only when executing action deliver^, (i,r)) p ; but in this case we 
have that the leader of the configuration is r and order{i) r = m. □ 
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We remark that the knowledge of order may diverge for those processes that remain in obsolete 
configurations. For example if a process p updates order (i) p to m because it receives such an m from 
the leader of a configuration c\ , but a new configuration c-i is established before any other process 
updates order(i), then the only two processes to have order(i) = m might be p and c-y.ldr. Assume 
that neither p or c\.ldr is a member of c-i. Then the leader of c-i can write something different from 
m into order (i). The above scenario is possible because m is not known to enough processes. 

Given an index i, an m G V and a configuration c we say that the triple (i,m,c) is good 
in a state s if there exists a quorum Q e c.qrms such that for every process p £ Q we have 
s. build-order (c.id, i) p = m. 

We also say that (i,m) is good if there exists a configuration c such that (i,m,c) is good and 
that an index i is good if there exists m such that (i,m) is good. 

The definition of a good index admits the possibility that in a given state there exist m and m' 
such that (i,m) and (i,m') are both good; however, as we will see later, this never happens in the 
algorithm. Indeed the notion of good index is intended to capture the fact that an operation m has 
been assigned to the i th position of order. In the following we will prove that operations assigned 
to good indexes are propagated to newer configurations. 

Invariant 7.3.3 In any reachable state the following is true. Suppose that w € Est and c G Est 
such that w.id < c.id, and there are no totally established configurations x with w.id < x.id < c.id. 
Suppose (i,m,w) is good. Then order (i) p = m for every p G state- dlv[c.id]. 

Proof: We prove the invariant by induction on the length of the execution. The base case is 
vacuously satisfied. 

For the inductive step, let s be a reachable state and assume that the invariant is true in all 
states previous to s. We need to prove that the invariant is true in s. 

Let w and c be configurations satisfying the assumption of the statement. Let (i,m,w) be good 
in s. Since (i,m,w) is good in s, there exists a quorum Q G w.qrms of processes such that for each 
r G Q we have s. build-order (w.id, i) r = m. For the rest of the proof we fix such a Q. 

We need to prove that any process p G s. state- dlv[c.id] has s. order (i) p = m. Processes that 
establish c set their order variables to the value computed by the condenser function 4> r abstate for 
configuration c. Hence we need to look at the inputs that the condenser function 4> ra bstate for 
configuration c receives from the members of c. 

Since c is established, all members of c submit their state to the condenser function <j} ra bstate for 
configuration c. Partition c.set into three subsets Si, S2 and S3, as follows: Si contains the processes 
that had high < w.id at the moment they submitted the state to 4> r abstate for configuration c; S2 
contains the processes that had high = w.id at the moment they submitted the state to 4> r abstate for 
configuration c; £3 contains the processes that had high > w.id at the moment they submitted the 
state to 4>rabstate for configuration c. 
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In the following we provide three claims that will be used to complete the proof. 
Claim 1: 5 2 U 5 3 7^ 0. 

Proof of Claim 1: By Invariant 7.1.1, a quorum Q' £ w.qrms is included in c.set. 
Let r be a process in Q n Q 1 . Such a process exists because Q and Q 1 are quorums of 
w. Clearly r £ c.set. Since process r £ Q we have that s. build-order (w. id, i) r = m. By 
definition of build-order we have that process r has established w in state s. Process r 
has to establish w before submitting its state for c, because it does not take any action 
for w after participating in c. Hence at the moment r submits its state for c we have 
that high r > w.id. Therefore r £ 5 2 U 5 3 . Thus 5 2 U 5 3 7^ 0. 

Claim 2: If q> G 52 U 53 then q submits either m or ± as order (i) to the condenser function 4> ra bstate 
for configuration c. 

PROOF OF Claim 2: Fix any q £ S 2 U S3. Let s' be the state in which process q 
submits its state for the condenser function 4> ra b s tate for configuration c. We need to 
prove that s 1 .build- order (w 1 .id, i) q = m, where w' is the configuration that q establishes 
before joining c. (Configuration w 1 is equal to w for q S 52 and is a later one for q € S3.) 

We consider two cases: q £ S 2 and q £ S3. 

Case 1: g e 52. In this case w' = w so we need to prove that s 1 .build- order (w.id, i) q is 
either m or _L. 

We first notice that state s' precedes state s. 

We consider two cases: (i) q G Q, and (m) q $l Q. 

Case 1.1: q e Q. Since in s' process q already participates in c and c.id > w.id 

we have that after s 1 process q does not execute any action for configuration w 

and thus build- order (w. id, i) q does not change after s'. Since q € Q, we have 

that s.build-order(w.id,i) q = m. Since build- order (w. id, i) q does not change 

after s', we must have s' .build- order (w.id, i) q = m. 

Case 1.2: q £ Q. We first notice that since q £ S 2 we have that that 

q> G s'.stoie-dfo[w;ic(]. This implies that q £ s. state- dlv[w. id}. 

If s. build- order (w. id, i) q = ±, since process q has already left configuration w 

by state s', we have that s' .build-order (w.id, i) q = ±, as needed. Hence assume 

that s. build-order (w. id, i) q ^ ±. 

By Invariant 7.1.1, a quorum Q" £ w.qrms is included in c.set. Since Q and 

<2" are quorums of w, there exists a process reQfl Q". Clearly r € c.set. 

Since r £ Q we have that s. build- order (w. id, i) r = m. 
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Next we prove that s. build- order (w. id, i) q = m. 

If w is not established at w.ldr in s, by Invariant 7.3.1 we have that s. build- order (w. id, i) q = 

s. build- order (w. id, i) r = m. 

liw is established at w.ldr in s, by Invariant 7.3.2 we have that s. build- order (w. id, i) w .idr = 

s. build- order (w. id, i) r = m. By the same invariant, since s.build-order(w.id, i) q ^ 

_L, we have that s.build-order(w.id, i) q = s.build-order(w.id, i) w .idr = m, hence 

also in this case s. build- order (w. id, i) q = m. 

Thus we have that s. build- order (w. id, i) q = m. 

Since in state s 1 process q has already left configuration w, we have that after s 1 

process q does not change build-order (w. id, i) q . Since s.build-order(w.id,i) q = 

m we have that s 1 .build- order (w. id, i) q = m, as needed. 

Case 2: q € S3. This process arrives in configuration c from a configuration w 1 such 
that w.id < w' .id < c.id. We need to prove that s 1 .build-order (w 1 .id, i) q = m. 

By the inductive hypothesis we have that the statement is true in s'. By applying the 
inductive hypothesis to state s 1 with c = w 1 we have that s' .order (i) q = m. By definition 
of build-order we have that s 1 .build- order (w 1 ' Ad,i) q = m, as needed. 

Claim 3: At least one process in 52 U 1S3 submits m as order (i) to the condenser function 4> r abstate 
for configuration c. 

Proof of Claim 3: By Invariant 7.1.1, a quorum Q'" e w.qrms is included in c.set. 
Since Q and Q'" are quorums of w, there exists a process r £ Q r\ Q"'. Clearly r e c.set 
and also r £ 52 U 53. Since r e Q we have that s.build-order(w.id,i) r = m. 

If r arrives to configuration c directly from u>, then since s.build-order(w.id,i) r = m, 
process r submits m to the condenser function 4> r abstate for configuration c. 

Thus consider the case when r arrives to configuration c from u>', wid < w 1 .id < c.id. 
Let s' be the state in which process r submits its state to the 4> ra bstate function for 
configuration c. By applying the inductive hypothesis to state s 1 with c = w 1 we have that 
s' .order (i) q = m. By definition of build-order we have that s' .build- order (w' .id, i) q = m. 
Hence, also in this case, process r submits m to the condenser function 4> r abstate for 
configuration c. 

We are now ready to conclude the proof. 

By Claim 1, we have that 52 U 53 contains at least one process. Thus, by definition of the 
condenser function 4> ra bstate , we have that the states of processes in Si are ignored by 4> ra bstate ■ So 
we only need to worry about what processes in 52 U 53 submit to the state condenser function for 
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configuration c. By Claim 2, we have that processes in 52 and ^3 submit either m or ± as the i th 
entry of order to the state condenser function for configuration c. By Claim 3, at least one process 
in 52 and 53 submits m. 

Hence 4> ra b s tate computes an order o for configuration c such that 0(1) = m. 

Every process p that establishes configuration c sets order p := o when executing action newstate({o, d, a)) p 
for configuration c. Thus, for such a process, we have that order(i) p = m when it establishes con- 
figuration c. Within configuration c, process p modifies order p only when receiving a message from 
the leader. However the leader also has order(i) c .idr = rn. So order p does not change. 

Hence we conclude that if p 6 s.state-dlv[c.id\, then s.order(i) p = m. □ 

The next invariant is similar to the previous one, but removes the requirement that there be no 
totally established configurations between w and c. 

Invariant 7.3.4 In any reachable state the following is true. Let w € Est and c G Est such that 
w.id < c.id. Let (i,m,w) be good. Then order(i) p = m for every p G state-dlv[c.id]. 

Proof: Let s be any reachable state and let w and c be as required by the statement. Let (i,m,w) 
be good in s. Let p G state-dlv[c.id]. We need to prove that s.order(i) p = m 

Let xi,X2, ...#fc be the sequence, in order of configuration identifier, of totally established config- 
urations between w and c. Since (i,m,w) is good, by Invariant 7.3.3 we have that s.order(i) q = m 
for any process q such that q € state- dlv[xi.id]. Configuration x\ is totally established, hence for any 
q G x\.set we have s.order(i) q = m. Hence we have s. build- order {x\. id, i) q = m for each member of 

Xi. 

It follows that (i,m,xi) is good. Thus by Invariant 7.3.3, used with w = x\, we have that 
s.order(i) q = m for any process q such that q 6 state-dlv[x2-id]. Configuration x-i is totally estab- 
lished, hence for any q G xi.set we have s.order(i) q = m. Hence we have s. build- order (xi .id, i) q = m 
for each member of #2 ■ 

It follows that (i,m, x-i) is good. We can iterate this reasoning for X3,...,xj, and obtain that 
s.order(i) q = m for any process q such that q € state- dlv[c.id], as needed. □ 

Next we show that in a given state no two operations can be good for a particular index i. 

Lemma 7.3.5 In any reachable state, given an index i, there exists at most one operation m such 
that (i, m) is good. 

Proof: Fix any reachable state s. By contradiction assume that there exist m and m' such that 
(i, m) and (i,m') are good in s and such that m 7^ m'. 

By definition of good we have that there exists at least one configuration w 1 such that (i, m, w 1 ) 
is good in s. Let wi be the configuration with the smallest identifier among the configurations w' 
for which (i,m,w') is good. Of course (i,m,u>i) is good in s. 
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Similarly, by definition of good we have that there exists at least one configuration w' such 
that (i,m',w') is good in s. Let W2 be the configuration with the smallest identifier among the 
configurations w 1 for which (i,m',w') is good. Of course (i,m',u>2) is good in s. 

Without loss of generality assume that w\ .id < w^ -id. 

We now distinguish two possible cases. 

Case 1: There exists c € s.Est such that W2-id < c.id. Fix p € s.state-dlv[c.id\. By Invari- 
ant 7.3.4, applied with w = wi, since (i, m, wi) is good, we have that s. order (i) p = m. 

By the same invariant, applied with w = W2, since (i,m',W2) is good, we have that s.order(i) p = 
m'. This is a contradiction since m^vnl . 

Case 2: There is no c G s.Sst such that W2-id < c.id. Since {i,m' ,1x12) is good in s we have that 
there exists a quorum Q G c.set such that s. build- order (u>2 -id, i) = m' for all members of Q. Fix 
p € Q. We have s .build-order {w2-id,i) p = m'. Since there is no c e s.Est such that W2-id < c.id, we 
have that W2 is the latest configuration established by p. Hence we have that s.order(i) p = m'. 

By Invariant 7.3.4, applied with w = wi, we have that s. order (i) p = m. This is a contradiction 
since m ^ m 1 . □ 

The following lemma generalizes the previous one by claiming that, even across an entire exe- 
cution and not just in single state, we cannot have two different elements of V being stable at the 
same index. 

Lemma 7.3.6 Let a be an execution. Let s and s 1 be two states of a and let m,m! € V be such 
that m ^ m' . Then it cannot be that (i,m) is good in s and (i,m!) is good in s' . 

Proof: By definition of good we have that if (i,m) is good in a state s then (i,m) is good in any 
subsequent state s'. Then the lemma follows easily by Lemma 7.3.5. Q 

The following lemma states that an index i for which an answer has been computed is good. 

Lemma 7.3.7 In any reachable state we have that if req-answrd(i) p = true for some process p then 
i is good. 

Proof: Let s be a reachable state and assume that s.req-answrd(i) p = true. 

Variable req-answrd(i) p is set to true by process p either when providing the answer for op- 
eration i with action p2p-send({ "ans",a)) Ptq or when establishing a new configuration with action 
newstate({o, d, a)) p . 

In the former case process p is the leader of some configuration c in which the answer for operation 
i is computed. Process p computes such an answer only after informing a quorum Q G c.qrms, by 
means of the underlying DLC service, about order (i) p . Hence i is good in s. 

In the latter case process p sets req-answrd := a. So req-answrd(i) p is true only if a(i) is 
true. But a(i) is true only if the leader p' of some previous configuration has computed an answer 
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for operation i and thus has executed action p2p-sEND({"ans",a)) for i. Let s 1 be the state when p 1 
computed the answer for operation i. Clearly i is good in s 1 . But once i is good in a state it stays 
good in all subsequent states. Hence i is good in s. □ 

In order to prove that the system implements an atomic object we use the following lemma, which 
is a version of Lemma 13.16 of [65] (page 435) that considers general operations instead of specific 
read/write operations. 

Lemma 7.3.8 Let ft be a (finite or infinite) sequence of actions of an atomic object external inter- 
face. Suppose that j3 is well-formed, and contains no incomplete operations. Let II be the set of all 
operations in f3. Suppose that -< is an irreflexive total ordering of the operations in II, satisfying the 
following properties: 

1. For any operation A €H, there are only finitely many operations B such that B -< A. 

2. If the response event for operation A precedes the invocation event for operation B in ft, then 
A<B. 

3. The response for any operation A £ II is the result of applying all the operations that precede 
A, including A itself, in the order -<. 

Then j3 satisfies the atomicity property. 

Finally we can give the following claim. 

Claim 7.3.9 The system RAB implements an atomic object. 

Proof: By Lemma 13.10 of [65] (page 419) we can restrict our attention to executions with only 
complete operations. Fix such an execution a. Remember that n is the set of operations of a. 

In order to show that the system implements an atomic object we need to provide a total order -< 
on n that satisfies Lemma 7.3.8. Let us define the order -< as follows. 

Let A be an operation in II. By definition of n we have that A gets completed. In order for A 
to complete there must be a leader that stores A in some position i and computes an answer for A 
setting req-answrd(i) to true. So there exists a state s of a such that s.req-answrd(i) p = true for 
some p. By Lemma 7.3.7 we have that (i, A) is good in s. Denote by tag(A) the index i. This is 
well defined because of Lemma 7.3.6. Note that no two operations A and B can get the same tag i, 
because we would have that both (i, A) and (i, B) are good in some state contradicting Lemma 7.3.6. 

Order all operations in n in order of tag. 

Next we prove that -< satisfies the hypothesis of Lemma 7.3.8. 

Let us start with Point 1. Fix A € II. Any operation B < A must have tag(B) < tag(A). The 
number of such operations is bounded by tag (A). 
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Now consider Point 2. Fix A, B G II and assume that the response event for operation A precedes 
the invocation event for operation B. 

Since A, B € II there exists a state s of a and two indexes i,j such that (i,A) and (j, B) are 
good in s. By definition of tag we have that i = tag (A) and j = tag(B). Hence we need to prove 
that i < j. 

Let s' be the state when B gets invoked. Since the response event of A precedes the invocation 
event of B, we have that the response event of A precedes s'. 

Since A received a response before state s', we have that there exists a state s" preceding s 1 such 
that s" .req-answrd(i) p = true for some process 2 p. By the code (see action p2p-sEND«"ans",a))j,, <? ), 
we have that s" .req-answrd(k) p = true for any k < i. By Lemma 7.3.7 we have that in state s" all 
indexes k < i are good in s" . Since s" precedes s', indexes k < i are good in s' too. 

This implies that for each index k < i there exists an operation Af. such that {k,Af.) is good in 
s' . Moreover none of these operations Af. can be equal to B because B is invoked in state s' and 
thus cannot be good in state s'. 

Remember that (j, B) is good in s. By Lemma 7.3.6 it cannot be that j < i because for any 
index k < i there exists an operation Af. ^ B such that (k, Af.) is good in s'. 

Hence it must be that i < j, as needed to prove Point 2. 

Finally consider Point 3. This condition is true because responses are given in order of tag and 
by Lemma 7.3.6 this order is consistent for all operations in n. Q 

7.4 Remarks 

The rab algorithm implements an atomic shared object. As a particular case we may have a read- 
write register. In Chapter 6 the algorithm abd also implements an atomic shared read-write register. 
The latter is built upon the DC service while the former is built upon the dlc service which is a 
variation of the DC service that defines a leader within each configuration. The two algorithms are 
similar (indeed they use similar services as building block): both rely on spreading each operation 
to a quorum of processes in order to keep data consistency. The main difference is that the rab 
algorithm uses the leader of the current configuration in order to centralize the handling of the 
requests from the clients; within each configuration, the leader of the configuration is responsible for 
providing answers to the requests. With such an approach only the leader needs to have the most up 
to date information and thus, when the system is stable, this approach is more efficient. In the abd 
algorithm at any time there is a quorum of processes which have the most up to date information. 



2 This process p is the leader of the configuration in which the answer to operation A is computed. However this 
is not important for this proof. 
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As the Liskov-Oki algorithm [76], the rab algorithm uses a centralized approach where a distin- 
guished process is responsible to perform requested operations; however, such a process needs the 
cooperation of a quorum of other processes in order to provide answers to the requested operations. 
Our algorithm is dynamic and does allow change in the universe of processes while the Liskov-Oki 
algorithm assumes a fixed universe of processes. The rab algorithm uses a more conservative ap- 
proach in providing answers to requested operations: the leader does not respond to a requested 
operation until it knows that a quorum of processes have recorded that operation. In the Liskov-Oki 
approach the leader immediately respond to requested operations; this is more efficient when there 
are no failures, but in case of failures it is less efficient (roll back might be necessary). 

The multipaxos algorithm [61] can also be used to implement a replicated atomic object. Indeed 
processors can agree on the sequence of operations to perform on the shared object by running a 
sequence of instances of a consensus algorithm. The usefulness of developing the rab algorithm is 
that we use a building block which provides a powerful service and thus much of the computation 
that needs to be done is delegated to the dlc building block. This results in a simpler algorithm. 
The overall algorithm is similar to an algorithm that would use the multipaxos approach; however 
the philosophy underlying the building blocks approach is that building blocks are built once and 
then can be used by many applications which can take advantage of powerful properties offered by 
the building block. With such a perspective designing rab is easier than designing a replicated 
atomic object based on multipaxos (the interested reader can compare the code of multipaxos 
provided in [29] with the code of rab) . 
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Chapter 8 



Conclusions 



In this thesis we have provided a set of group communication specifications. We have also given 
implementations of the specifications and we have constructed applications on top of the specifica- 
tions. 

The main theme has been that of providing "dynamic" group communication specifications, 
that is, specifications for group communication services that adapt well to dynamic changes of the 
underlying distributed system. This is crucial in systems where processes can join or leave the 
system routinely because of process or link failures. 

In such settings, it is possible that the underlying system suffers partitions. In the presence 
of partitions two approaches can be followed: one is to allow every component of the partition to 
proceed independently; another one is to select a unique "primary" component of the partition and 
allow progress only in that component. The former approach improves availability at the expense 
of shared data coherence. The latter is to be used when replicated data needs to be maintained 
coherently. Most group communication services and specifications take the first approach: they are 
partitionable. 

When applications require a primary component but run over a partitionable group commu- 
nication service, it is the responsibility of the application to figure out whether it is in a primary 
component or not. Establishing whether the current component is primary or not is clearly indepen- 
dent of the particular application. Thus it would be better to move this problem from the application 
to a lower level layer. One possibility is to use a primary component group communication service 
as building block. 

In Chapter 5 we have considered the extension of existing partitionable group communication 
services to primary ones. We have provided a specification for a dynamic primary view group 
communication service called dvs. The communication tools provided by dvs are those typical of a 
group communication service; the membership service provides the client with primary views. 
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We have also shown that the dvs service is implementable. Our implementation is based on the vs 
service of Fekete, Lynch and Shvartsman [41] and uses ideas from the dynamic membership algorithm 
of Yeger Lotem, Keidar and Dolev [89]. The implementation filters the views provided by the vs 
service in order to establish whether the systems has partitioned and in such a case to report to the 
clients only views satisfying particular intersection properties with previous views. Such views are 
the primary components of the partition. By reporting to the clients only these primary views, the 
service enforces that computation proceed only in the primary component. 

In order to show the usefulness of the dvs specification we have developed an application on top 
of it. The application we have developed provides a totally ordered broadcast service: clients are 
allowed to broadcast messages to all other members of the system and the service guarantees that 
each member of the systems receives the messages in the same order. This is a very powerful service 
to develop replicated data algorithms or any other application that necessitates data coherence. 

In Chapter 6 we tackled the problem of extending dynamic primary view services to dynamic 
"configuration" services. A configuration is a view with a quorum system defined on the membership 
set of the view. The use of quorums is a well-known technique to improve availability and efficiency 
in a distributed system. With quorums usually a client request is serviced by a quorum of the set 
of all the members of the system (as opposed to the whole set of members). Our goal has been that 
of integrating the use of quorums in a group communication system. In particular we extended the 
dvs service to handle configurations. The result has been a specification for a primary configuration 
group communication service, called DC. The main difficulty in developing DC has been that of 
defining a dynamic primary configuration. The notion of dynamic primary view has been well 
studied (e.g., [55, 89]). As far as we know, there was no corresponding notion for configurations. 
We have developed such a notion and used it to specify the DC service. 

As for the dvs service, we have proved that DC is implementable and useful. The implementation is 
very similar to that of DVS. It uses a static service internally, which provides any new configurations. 
Then it filters these configurations to find those that satisfy certain intersection properties. These 
configurations are the primary configurations which are given to the clients of the service. 
The application we have developed on top of DC is a multi-writer/multi-reader atomic register. 
This application is based on the work of Attiya, Bar-Noy and Dolev [12] and that of Lynch and 
Shvartsman [66]. The algorithm exploits the quorum-oriented framework provided by the DC service. 

Finally in Chapter 7 we have explored the use of the techniques deployed in the development 
of DVS and DC to the design of dynamic algorithms, i.e., algorithms that work well in dynamic 
distributed settings. 
Lamport's paxos algorithm uses quorums to solve the consensus problem; however it is designed for 
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a static settings. We have used the DC service 1 in order to design a dynamic version of the paxos 
algorithm, a version that adapts well to system changes, even permanent ones. 
We also have provided a dynamic primary copy data replication algorithm. As the dpaxos algo- 
rithm, also this algorithm uses (a variant of) the DC service as a building block. We have sketched 
the proof of correctness of this algorithm (the formal proof is left as future work.) 

Applications developed on top of powerful building blocks are easier to build than those built 
from scratch, because such applications can benefit from the guarantees provided by the building 
blocks. We have shown that our group communication building blocks are powerful enough to 
build interesting applications. We think that other applications can be built on top of the group 
communication services (or variations of them) provided in this thesis. 

An interesting feature of the DC specification is that it integrates a state exchange mechanism 
within the service. When a new configuration is delivered to the client, the client is supposed to 
submit its current state to the service. Once the service receives the state from all the members of 
the configuration, it computes a new up-to-date state and delivers this state to each member of the 
configuration. In this way the state transfer is relegated to the DC service. 

The DC service requires all members of the configuration to submit the state before computing a new 
up-to-date state. It would be interesting to explore the possibility of computing the new up-to-date 
state when only a quorum of the processes have submitted their state. Clearly the resulting service 
would be weaker, but it is possible that useful applications can still be constructed on top of this 
weaker service. The advantage would be a more available service. 

One of the major goal of the current research in this area is to provide simple, universally accepted 
specifications that describe the semantics of the existing group communication services already in 
use in real-world application. Probably it is not possible to give a unique specification good for all 
applications: different applications will require different group communication services, which will 
be tailored to the applications. Another approach consists of defining independent protocol layers 
that implement different service levels and semantics (e.g., as is done for example in Horus and 
Ensemble). The application developer can use any combination of these layers building the right 
semantics for the needed group communication service. 

Though much work has been focused in providing specifications for group communication services 
(we refer the reader to Chapter 2 for pointers to the literature or to [87] for a survey) , the overall goal 
has not been achieved yet. It would be good to provide a set of universally accepted specifications 
for group communication services that cover all possible needs of applications. System implementors 



'We actually have used a variant of the DC tailored to the particular application that we have developed. See 
Chapter 7 for more details. 
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could then concentrate on efficient implementations of such specifications and application developers 
could build their applications on top of the guarantees provided by the specifications. 

In this perspective, the contribution of this thesis is that of having provided formal specifications 
for two particular group communication services tailored to applications that run in dynamic systems 
and that require primary views and configurations. 

Possible future work that follows the direction of this thesis may include the following. 

One could provide performance and fault-tolerance analysis of the algorithms presented in the 
thesis. We have focused our attention on the safety properties; though our algorithms are not 
naive 2 , we have not proved any performance property. Such a performance analysis could be based 
on assumptions on the underlying physical distributed system (as it is done in [41]). Moreover since 
we were concerned only with safety our algorithms are not tuned for optimal performance. 

Hence one could optimize the algorithms presented in the thesis and compare them with other 
algorithms. In particular it would be interesting to provide a dynamic version of the MULTIPAXOS 
algorithm built upon the dlc group communication service. In Chapter 7 we have provided a 
dynamic version of the PAXOS algorithm but not a dynamic version of the MULTIPAXOS algorithm. 
One could provide such a dynamic version of multipaxos and compare its performance with that of 
the original multipaxos algorithm (see [61, 29]). We have provided two algorithms that implement 
atomic objects: the abd-sys algorithm, which is based on a decentralized approach, and the rab 
algorithm, which is based on a centralized approach. We have not tuned the code of these algorithms 
for efficiency It would be interesting to provide optimized code and compare the performance of 
these algorithms with others that solve similar problems (e.g., the Liskov-Oki algorithm [76]). 

Another possibility is to provide variations of the group communication services presented in 
this thesis. An interesting one is that of weakening the DC specification presented in Chapter 6 in 
order to allow the computation of the starting state for a new configuration as soon as the processes 
in a quorum of the configuration have submitted their state (the version we have used requires all 
the members of the configuration to submit their state) . We believe that this weaker version is still 
powerful enough to build useful applications. 

More generally it would be interesting to build other algorithms on top of the group communica- 
tion service building blocks provided in this thesis and also to provide new building blocks tailored 
to other applications. 



2 An algorithm that does nothing is a safe algorithm. 
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Chapter 9 

Distributed fc-set Consensus: 
Overview 



The problem of reaching consensus in a distributed system arises in many forms and various contexts, 
such as, for example, distributed data replication, distributed databases, flight control systems. Data 
replication is used in practice to provide high availability: having more than one copy of the data 
allows easier access to the data, i.e., the nearest copy of the data can be used. However, consistency 
among the copies must be maintained. A consensus algorithm can be used to maintain consistency. 
Another practical example of the use of data replication is an airline reservation system. The data 
consists of the current booking information for the flights and it can be replicated at agencies spread 
over the world. The current booking information can be accessed at any of the replicas. Reservations 
or cancellations must be agreed upon by all the copies. 

Distributed consensus has been extensively studied; a good survey of early results is provided in 
[42]. We refer the reader to [65] for a more up-to-date treatment of consensus problems. 

One of the most celebrated results about distributed consensus is the impossibility result of 
Fischer, Lynch and Paterson [43]. This impossibility result, popularly known as FLP, states that 
it is impossible to achieve distributed consensus in asynchronous systems even if only one stop 
failures is possible. This surprising result sparkled various directions of research aimed to solve the 
problem by either restricting the asynchrony of the computation model [31, 35], or using randomized 
protocols [18, 21, 80], or weakening the problem definition [24, 32, 39, 40]. 

The last of these three directions of research falls in the more general research area of demarcat- 
ing what is deterministically computable and what is deterministically impossible in asynchronous 
distributed systems in the presence of failures. The FLP impossibility seemed to suggest that no 
nontrivial problem could be solved deterministically and asynchronously in the presence of faults. 
Attiya, Bar-Noy, Dolev, Peleg and Reischuk [13] showed that the renaming problem can be solved in 
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a deterministic way in asynchronous system in the presence of failures. Informally, in the renaming 
problem processors start the computation with a "name" taken from some unbounded ordered name 
space and have to "rename" themselves with names chosen from a new small name space. This result 
revived the research trend of exploring computable and impossible in deterministic asynchronous 
distributed systems subject to failures. Following this direction, Chaudhuri [24] defined the fc-set 
consensus (or fc-consensus for short) problem, which is a natural generalization of the consensus 
problem obtained by allowing processes to decide on k different values, instead of requiring them to 
agree on a single value. The 1-consensus problem is the classical consensus problem. 

Chaudhuri provided an algorithm to solve the fc-consensus problem that tolerates up to a thresh- 
old t of process failures strictly smaller than k. This result proved that the fc-consensus problem, 
for A; > 2, allows more resilience than the 1-consensus problem. Chaudhuri conjectured that the 
fc-consensus problem was impossible to solve while tolerating k or more failures. This conjecture was 
proven true by three independent research teams: Borowsky and Gafni [20], Herlihy and Shavit [52] 
and Saks and Zaharoglou [84]. Attiya [11] provided an alternative proof of the same result. 

The results of [24, 20, 52, 84] completely characterize the fc-consensus problem in asynchronous 
systems with stop failures. In such a model the /^-consensus problem is solvable if and only if t < k. 

The formal definition of the fc-consensus problem requires three conditions to be satisfied: agree- 
ment, termination and validity. The agreement condition requires that each process decide on a 
value in such a way that the set of decided values has cardinality at most k. The termination con- 
dition simply requires that each (correct) process decide. For what concern the validity condition, 
several variants have been considered in the literature. The validity condition used in [24, 20, 52, 84] 
requires that each of the decision be equal to some input value. 

An alternative definition of the validity condition considered for the 1-consensus problem with 
stop failures requires that if all the inputs to the processes of the systems are equal then any decision 
must be equal to the input (see, for example [65, Ch. 6]). This condition is the one considered for 
the fc-consensus problem. 

In a Byzantine environment faulty processes can "mask" their inputs. Hence a more suitable 
validity condition considered for the 1-consensus problem with Byzantine failures requires that if all 
the correct processes have the same input then any decision be the input of a correct process [62, 78]. 

In this thesis we explore the solvability of the fc-set consensus problem in asynchronous message 
passing models in which processes fail by stopping or fail arbitrarily (Byzantine failures). The main 
theme is that the validity condition has a profound impact on when the problem is solvable. We 
consider six different validity conditions and use these conditions to demarcate when fc-set consensus 
is solvable for each system model. In several cases we completely characterize solvability. In some 
we characterize solvability with very little uncertainty (i.e., a small gap between computable and 
impossible) and in a few cases we leave a substantial gap. 
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More in details, we start from the validity condition used by Chaudhuri, which we call the 
"regular" validity condition (the decision of any correct process is the input of some process) and 
denote by RVl, and consider a weakened version (if there are no failures then every decision is the 
input of some process), denoted with wvl, and a strengthened version (the decision of any correct 
process is the input of some correct process), denoted by svl. For each of these three validity 
condition we consider a corresponding weakened version obtained by requiring the condition only 
if all the processes start with the same input value. We denote these validity conditions with sv2 
(if all correct processes start with v the correct processes decide v) , rv2 (if all processes start with 
v then correct processes decide v) and wv2 (if there are no failures and all processes start with v, 
then any decision is equal to v). 

For the crash failures model we completely characterize the line that separates possible from 
impossible for each of the above six validity conditions, with the exception of validity SV2 where a 
tiny gap is left open. 

For the Byzantine failures model we characterize the line line that separates possible from im- 
possible leaving a tine gap for sv2, rv2 and wv2, and a substantial gap for wvl. 

The rest of this part is structured as follows. In Chapter 10 we give the formal definition of 
the fc-consensus problem. We study the fc-set consensus problem for crash failure in Chapter 11. 
Chapter 12 presents the results for Byzantine failures. 
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Chapter 10 



The problem 



We consider a distributed system consisting of n processes denoted by pi,p 2 , ...,p n . A process that 
follows its algorithmic specification throughout an execution is said to be correct, and a process 
that departs from its specification is said to be faulty. In a fail-stop model (also known as a crash 
model), faulty processes are allowed to prematurely halt execution only. In a Byzantine model, a 
faulty process can deviate from its specification arbitrarily. We assume that at most t processes fail, 
where t > 1 is a known, positive integer. 

We assume that the system is asynchronous. Processes communicate by sending messages over 
a network. We are not concerned with the particular topology of the network. Since we consider 
asynchronous systems, messages can take arbitrarily long time to reach their destination. We only 
assume that the network of processes is connected, that is, any process can send a message to any 
other process. Moreover, messages are not altered, lost or duplicated while in transit on the network. 

For any k, 1 < k < n, we denote a fc-set consensus problem by K.C(k) or simply AX when k is 
not relevant. The K.C(k) problem is defined as follows. Each process pi starts the computation with 
an input value Vi. Each correct process has to irreversibly "decide" on a value in such a way that 
three conditions, called termination, agreement and validity, hold. These conditions are: 

Termination: Every correct process eventually decides. 

Agreement: The set of values decided by correct processes has size at most k. 

Validity: One of the following conditions. 

SVl (strong vl): The decision of any correct process is equal to the input of some 

correct process. 
SV2 (strong v2): If all correct processes start with v then correct processes decide v. 
RVl (regular vl): The decision of any correct process is equal to the input of some 

process. 
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RV2 (regular v2): If all processes start with v then correct processes decide v. 
WVl (weak vl): If there are no failures, then the decision of any process is equal to 

the input of some process. 
wv2 (weak v2): If there are no failures and all processes start with v, then the 

decision of any process is equal to v. 

Given a validity condition C, we denote by K.C(k,C) the K.C(k) problem defined with validity C. 
We also use the notation K.C(C) if k is not relevant. We use the notation K,C(k,t) to denote a KC{k) 
consensus problem with at most t failures allowed. The notation K.C(k,t,C) denotes K.C(k,t) with 
validity C. 

We define a partial order on the AX problems based on the strength of the validity conditions. 
We say that K.C(C) is weaker than K,C{D) if any algorithm for solving K.C(D) can be used to solve 
JCC(C) in a given model. Clearly K.C(C) is weaker than K.C(D) if any impossibility result that holds 
for KC{C) holds also for JCC(D). Conversely, we say that JCC(C) is stronger than JCC(D) if JCC(D) is 
weaker than KC(C). Figure 10-1 shows the "weaker than" relation between the validity conditions. 




Figure 10-1: Validity conditions. An arrow from a validity condition C to a validity condition D means 
that KC(C) is weaker than JCC(D) (and that KC(D) is stronger than JCC(C)). 

/CC(fc,RVl) is the consensus problem as considered by Chaudhuri [24]. /CC(l,RVl) and AX(1,rv2) 
are classical consensus problems (see, e.g., [65, Ch. 6]). /CC(l,sv2) has been considered in the 
Byzantine setting [62, 78]. /CC(1,WV2) is weak Byzantine agreement [60]. 

It is well known that the case k = 1 cannot be solved for any nontrivial validity condition and, 
in particular, for any of the validity conditions that we consider here, for any t > 1 [43]. On the 
other hand, if k = n then JCC(k) is trivially solvable (each process decides its own value), even in 
the Byzantine setting, for any t and with the strongest validity condition we are considering, that 
is, validity svl. Thus, we will henceforth be concerned only for the cases 2 < k < n — 1. Since the 
problem is easily solvable for t = we also assume that t > 1. 
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Chapter 11 



Crash failures 



In this section we consider the crash failures model. In Section 11.1 we recall known results. In 
Sections 11.2 and 11.3, we provide further impossibility results and protocols, respectively. Figure 11- 
1 shows a graphical representation of the results provided in this section. 

11.1 Known results 

As noted in Section 9, for the crash failure models we already know the line between computable 
and impossible for KC(k, i,RVl): 

Lemma 11.1.1 ([24]) In the crash model, there is a protocol for KC(k,t,RVl), for t < k. 

Lemma 11.1.2 ([20, 52, 84]) In the crash model, there is no protocol for ICC (k,t,RVl), fort > k. 

By Lemma 11.1.1, we have that K,C(k,t,Rv2), K.C(k,t,wvl) and )CC(k,t,wv2) are solvable for 
t < k because rv2, wvl and wv2 are weaker than rvI. By Lemma 11.1.2, K.C(k,t,svl) cannot be 
solved for t > k because svl is stronger than RVl. 

11.2 Impossibilities 

In this section we provide impossibility results for the crash model. An ingredient in most of our 
impossibility results is the fact that in any protocol tolerating t failures, a process must be able 
to decide after communicating with at most n — t processes (including itself). Indeed, if a process 
waited to communicate with more than n — t processes, termination could not be achieved: the runs 
in which there were exactly t faulty processes that do not send any messages, would not terminate. 

Lemma 11.2.1 In the crash model, there is no protocol for K.C(k,t,WV2), for t > - - j_ . 
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Figure 11-1: Crash model. Regions filled in brick pattern indicate impossibility. Regions filled in honeycomb 
pattern indicate solvability. Unfilled regions indicate open problems. Figures are drawn to scale n = 64. 
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Proof: For a contradiction, assume that such a protocol A exists. In the rest of the proof we use 
the notation K.Cp(k,t,C) to explicitly state the set P of processes among which fc-set consensus is 
to be solved. Denoting by V the set of all processes, we have that A solves K.Cj>(k,t,wv2). 

Since t > ((k — l)n + l)/k implies n > k(n — t) + 1, we can partition the n processes into k 
groups <?i, (72, ■■■, 9k °f disjoint processes with g-y, , ..., gk-i containing exactly n — t processes and g^ 
containing at least n — t + 1 processes. If t = n we let gy,gi, • ■•, <?jt-i be singleton sets of processes 
and we let g^ contain at least two processes (this is possible because we only consider k < n). 

First we claim that there is a run of A where only processes in gk take steps and such that two 
values are decided. To see why, assume that all the runs involving only processes of gk are such 
that only one value is decided. Then we could use A to solve K.C gk (l, l,wv2): gk contains at least 
n — t + 1 processes, so that even if one of them is faulty we still have at least n — t correct processes 
in g k and hence the protocol has to terminate. However, this contradicts [43], since no such protocol 
exists. Hence there is a run Uk in which only processes in gk take steps and they decide on at least 
two different values, say Vk , Vk+i ■ Let vy, ...,Vk-i be k — 1 values different from Vk,Vk+y- 

Fix i, i e {1,2, ...,k — 1} and consider the following run af. all processes are correct, all start 
with Vi and all messages sent to processes in gj, j = 1,2, ....,k by processes not in gj are delayed 
until all processes in gj make a decision. We can use A to solve K-C-p{k, i,wv2) and by validity wv2 
we have that all processes, in particular those in group gt, decide Vi. 

Now consider the following run a. All processes are correct, for each i, i = 1, 2, ..., A; — 1, each 
process in gt starts with Vi and processes in gk start with the same values they start in Uk . Moreover 
for each i, i = 1, 2, ..., k, all messages sent to processes in group gt by processes not in gt are delayed 
until all processes in g t have decided. We can use A to solve K.C-p(k, i,wv2) in a. However, for each 
i, i = 1,2, ...,k, processes in g^ cannot distinguish between run ctj and run a. Indeed in both runs 
they only communicate with processes in gi before making a decision and in both runs processes in 
gi start with the same value. Since, for i = 1,2, ...,k — 1, in run «j processes in gi decide Vi, they 
must decide Vi also in a. Since in run a.k processes in gk decide on Vk and Vk+i, they must decide Vk 
and Vk+i also in a. Hence we have that k + 1 values are decided in a. Thus the agreement condition 
is violated and this contradicts the hypothesis that A solves K.C-p(k,t,wv2). □ 

Lemma 11.2.2 In the crash model, there is no protocol for K,C(k,t,WVl), for t > k. 

Proof: For a contradiction assume that there exists such a protocol A. We claim that A can be 
used to solve /CC(fc,i,RVl) for t > k. To see why, consider any run a in which f < t processes are 
faulty and let g be the set of correct processes and g' be the set of faulty processes. Now consider 
a run a' that is identical to a except that all processes are correct and any message sent by any 
p e g' in a' after the time that p failed in a is delayed until after all processes in g decide. That 
is, for each Pi £ g and each pj € g' , Pi receives a message from pj at time T in a' iff pi receives the 
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same message at time T from pj in a. By the validity condition wvl, each process decides on some 
process' input in a'. Clearly, processes in g cannot distinguish between a and a'. Hence, processes 
in g decide the same value in a as they decide in a', and so validity RVl is satisfied in a. In other 
words, protocol A solves K.C(k,t,«vl) for t > k, contradicting Lemma 11.1.2. □ 

Lemma 11.2.3 In the crash model, there is no protocol for K,C(k,t,SVl). 

Proof: For a contradiction assume that there exists such a protocol A. Let a be an execution of A 
in which all processes are correct and they all start with different values. Let v a decision made by 
at least two processes (there is always such a decision since k < n). Because of validity svl, v is the 
input of some process p, and since all inputs are different only pi has v as input. Now consider the 
run a' that is the same as a except that process pi fails right after sending its last message. Clearly 
a and a' are indistinguishable and thus each process (maybe with the exception of pi) makes the 
same decision in both runs. Hence in a' value v is decided by at least one process pj, j ^ i. But 
only pi has v as input and pt is not correct in a', and so validity SVl is violated. Q 

Lemma 11.2.4 In the crash model, there is no protocol for ICC (k,t, SV2), for t > 2 k+i n - 

Proof: For a contradiction assume that there exists such a protocol A. Consider first the case 
t > j. Partition the system into two non-intersecting sets of processes, g, g' , each containing at 
least n — t processes (e.g., \g\ = \g'\ = n/2). This is always possible because t > n/2. Let a be a 
run of A in which all processes are correct, all start with different initial values denoted vi, v 2 , ..., v n , 
and all communication between g and g' is delayed until after the decisions are made. We claim 
that n values are decided in a. To see this, fix any process Pi £ g, and consider the following run 
on. The processes in g start with the same values as in a, and all except p t crash after pi reaches a 
decision. All the processes in g' start with Vi but communication between g and g' is delayed until 
after pi makes a decision. By sv2, pi must decide Vi in on, and by indistinguishability of a from «j, 
Pi must decide Vi in a. Similarly, runs a^ can be constructed for every process p\ G <?', and hence 
all processes must decide their own values in a. This contradicts the hypothesis that A solves the 
problem (for k < n). 

Now consider the case i < f . In this case, n — 2t > and the condition t > n-^rj is equivalent 
to k < ^Ezi ~ 1- Let 5 be a subset of the system containing n — t processes, and let g±, ..., g, „-t , 
be a partition of g into disjoint sets of size at least n—2t each. Let a be a run of A in which all the 
processes are correct, communication between g and the rest of the system is delayed until after all 
processes have decided and, for each i, processes in gt start with a distinct value Vi. Fix i, and let 
Pi £ 9i be some process. Consider a run on of A as follows: Processes in gi are correct, all processes 
in g \ gi are faulty, and crash after pi decides. All communication between g and the rest of the 
system is delayed until after pi decides. By sv2, pi must decide Vi, but since a is indistinguishable 
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to pi from Qj, pi must decide Vi in a. Therefore, in a, at least L^r^J different values are decided 
on. This contradicts the hypothesis that A solves the problem since k < Ji ^ T — 1 < I -^5^-J . Q 

it £h *~ it £h 

11.3 Protocols 

In this section we provide two protocols for the crash model. 

protocol A: Each process broadcasts its input and waits for n — t messages. If all n — t 
messages contain the same value v, then the process decides v, else it decides a default 
value vo- 

Lemma 11.3.1 PROTOCOL A solves ICC(k,t,RV2) in the crash model for t < ^j^n. 

Proof: We start by proving termination. The number of actual failures is less or equal to t. Hence 
there are at least n — t correct processes. Thus each correct process eventually receives at least n — t 
messages and is able to make a decision. 

Now we prove agreement. By the sake of contradiction assume that k + 1 values are decided. 
One of them could be the default value, but at least k values, different from the default value, are 
decided. By the protocol it is necessary that there be k disjoint sets g\,gi, ■■■,9k, each consisting of 
at least n — t processes such that each process in gt sends a value Vi (with Vi ^ Vj for i^ j). Hence 
there must be at least k(n — t) processes. However since t < ^j^n we have that n — t > n/k and 
that k(n — t)>n, which implies that there must be more than n processes. This is impossible since 
we have n processes. 

Finally we prove validity. Assume that all processes start with value v. Clearly a process cannot 
receive two different values since v is the only value being sent. Hence by the protocol each process 
that makes a decision, decides v. □ 

protocol B: Each process broadcasts its input and waits for n — t messages. One of 
these n — t messages is the process' own message. If n — 2t messages contain the same 
value as its own, say v, the process decides v, else it decides a default value vo- 

Lemma 11.3.2 PROTOCOL B solves ICC(k,t,SV2) in the crash model for t < ^j^n. 

Proof: We start by proving termination. The number of actual failures is less or equal to t. Hence 
there are at least n — t correct processes. Thus each correct process eventually receives at least n — t 
messages and is able to make a decision. 

Now we prove agreement. By the sake of contradiction assume that k + 1 values are decided. 
One of them could be the default value, but at least k values, different from the default value, are 
decided. By the protocol it is necessary that there be k disjoint sets g\,gi, ..., gu, each consisting of 
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at least n — 2t processes such that each process in gt sends a value Vi (with Vi ^ Vj for i ^ j). Hence 
there must be at least k(n — 2t) processes. However since t < 4^n we have that k(n — 2t) > n, which 
implies that there must be more than n processes. This is impossible since we have n processes. 

Finally we prove validity. Assume that all correct processes start with value v. We have to prove 
that a correct process decides v. Let p be a correct process. First we observe that since p starts 
with v it decides u or uo- Hence it suffices to prove that p receives at least n — 2t messages with 
v. Among the n — t messages p receives at least n — 2t are from correct processes. Hence process p 
receives at least n — 2t messages with v. □ 



11.4 Remarks 

For AX(EV2) and JCC (wv2), there is a very tiny gap between our possibility and impossibility results 
(Lemmas 11.2.1 and 11.3.1), formed by the cases where n is a multiple of k. These are isolated points 
on the line that separates possible from impossible. Since for all other points on this line the problem 
is not solvable it would be very surprising if for those isolated points the problem is solvable. For 
AX(sv2) there is also small gap between our possibility and impossibility results (Lemmas 11.2.4 
and 11.3.2). 
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Chapter 12 



Byzantine failures 



In this section we consider the Byzantine failures model. In Section 12.1 we are concerned with 
impossibilities and in Section 12.2 we provide protocols. Figure 12-1 shows a graphical representation 
of the results. 

12.1 Impossibilities 

In this section we provide impossibility results for the Byzantine model. Clearly the impossibilities 
proved for the crash model still hold. In particular the impossibilities for AX(svl) and ACC(wvl) 
are directly derived from the corresponding ones for the crash model. Next we provide additional 
impossibilities. 

Lemma 12.1.1 In the Byzantine model, there is no protocol that solves ICC (k,t,WV2), for t > 2 k+i n 
t> k. 



Proof: For a contradiction assume that such a protocol A exists. We distinguish two cases: (i) 
t > n/2 and (ii) t < n/2. 

Consider case (i). Let v\,V2, ■■■,vt+i be t + 1 different values. Let a be the following run of A. 
The number of actual failures in a is f = n — t — 1. Let F be the set of faulty processes and let 
Pi, ...pt+i be the correct processes. Process pt has input Vi, for i = 1, 2, ..., t+1. Messages between 
any two correct processes are delayed until all correct processes decide, that is, correct processes 
communicate only with processes in F. 

We now show that at least k + 1 values are decided in a, which contradicts the hypothesis that A 
solves the problem. For each i = 1, 2, ...,t + l consider the following run ctj. All processes are correct, 
all have input Vi, messages between processes not belonging to F are delayed until all processes not 
in F decide. By validity wv2, we have that in on all processes must decide Vi. Process p iy for 
i = 1,2,..., t+1, cannot distinguish between a and a i, if in a, the members of F behave as if they 
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Figure 12-1: Byzantine model. Regions filled in brick pattern indicate impossibility. Regions filled in 
honeycomb pattern indicate solvability. Unfilled regions indicate open problems. Figures are drawn to scale 
n = 64. 
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were correct and had Vi initially. Hence pt has to decide the same value in both runs. We have 
that process pi decides Vi also in a. Since v±,v 2 , ■■■,vt+i are different, we have that t + 1 values are 
decided in a. But t > k, hence at least k + 1 values are decided in a. 

Consider case (ii). Since t < n/2 we have that n — 2t > and thus the condition t > -^r\ n 
is equivalent to -^E^ > k + 1. Then, we can partition the processes into k + 2 groups, the first 
k + 1 of which, denoted gi,g2,...,gk+i, each consists of at least n — 2t processes, and the last of 
which, denoted F, consists of t processes. Let a be the following run of A. Let vi,v 2 , ...,vn+i be 
k + 1 different values. Processes in gt start with Vi, for i = 1,2, ..., k + 1, and processes in F are 
faulty. Processes in group gi communicate only within gi and with processes in F. For each group 
gi processes in F behave as correct processes with input v*. 

We now show that at least k + 1 values are decided in a, which contradicts the hypothesis that A 
solves the problem. For each i = 1,2,..., k+1 consider the following run on. All processes are correct, 
all have input Vi, processes in group gi communicate only within gi and with processes in F. By 
validity wv2, we have that in at all processes must decide Vi. Processes in gt, for i = 1, 2, ..., k + 1, 
cannot distinguish between a and on. Hence they have to decide the same value in both runs, and 
so processes in gi decide Vi also in a. Since v±,v 2 , ■■■,i>fc+i are different, we have that k + 1 values 
are decided in a. □ 

Lemma 12.1.2 In the Byzantine model, there is no protocol that solves K,C(k,t,Kyl). 

Proof: For a contradiction assume that such a protocol A exists. Let a\ be a run of A in which all 
processes are correct and each start with a different input value. Let v\, ...,v z be the set of values 
decided by correct processes. Because A satisfies validity rvI, each of the vt is the input of some 
process. Since z < k < n, we have that there exists a value Vi, 1 < i < z, decided by at least two 
processes, say pi and p 2 . 

Let process q be the process whose input in «i is Vi for some i G {1, ..., z}. Use A in the run a 2 
in which q is faulty but behaves as in a±, claiming that Vi is its input, but it has v\ as its input, 
with v'i different from Vi and also from any other input. Since correct processes cannot distinguish 
between «i and a 2 they have to decide on the same value. We now distinguish three possible cases: 
(1) q is different from both pi and p 2 , (2) q is pi and (3) q is p 2 . If g is different from both pi and p 2 
then both pi and p 2 are correct and thus they decide on Vi in a 2 . However Vi is not an input value 
in a 2 . Hence validity is violated. If q is p\ (resp. p 2 ) then p 2 (resp. pi) is correct and thus decides 
Vi in a 2 . However Vi is not an input value in a 2 . Hence validity RVl is violated. This contradicts 
the hypothesis that A solves AX(fc,£,RVl). □ 

Lemma 12.1.3 In the Byzantine model, there is no protocol for K.C(k,t,RY2), for t > 2 i k k +1 \ n- 
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Proof: The proof is similar to that for Lemma 11.2.4. For a contradiction assume that such a 
protocol A exists. We distinguish two cases: (i) t < n/2 and («) t > n/2. Consider case (i). Since 
t < n/2 we have that n — 2t > and thus the condition t > 2 (k+i) n * s equivalent to ^^l > k+1. 
Then, we can partition the processes in k + 1 groups each consisting of at least n — 2t processes. 
Consider case (ii). In this case we partition the processes in k + 1 groups each consisting of at least 
one process. 

In both cases, let <?i, g 2 , ■■■,gk,9k+i be the k + 1 groups of processes. Let v±, ...Vk+i be k + 1 
different values and consider the following run a. All processes are correct, processes in group gt 
start with Vi. For each group <?j, there is a set of t processes not belonging to gt, call it Ft, such 
that, for each i, communication is allowed only among processes in gt and Ft until all processes have 
decided. Notice that the cardinality of gi U Fi is at least n — t in both cases. 

We now show that k + 1 values are decided in a, which contradicts the hypothesis that A solves 
the problem. Fix i, 1 < i < k + 1, and consider run Qj. There are exactly t faulty processes and these 
processes are those in Ft. Processes in gt are correct. All processes start with Vi. Faulty processes 
behave exactly as they do in run a. Processes in gi communicate only with other processes in gi and 
Fi. We can use A to solve K.C(k,t,«v2), and by the validity rv2 we have that all correct processes, 
and in particular those in gi decide Vi. Processes in gi cannot distinguish run a and run ctj. Hence, 
since they decide Vi in «j they have to decide Vi also in a. It follows that k + 1 values are decided 
in a. D 

12.2 Protocols 

In this section we provide protocols for the Byzantine model. We start by observing that PROTO- 
COL A, used for the crash model, solves AX(wv2) also in the Byzantine model, though only for a 
restricted range of values of k and t. 

Lemma 12.2.1 PROTOCOL A solves ICC(k,t,WV2) in the Byzantine model for t < n/2 and k > 



n—t 
n-lt 



+ 1. 



Proof: We start by proving termination. Since there are at most t failures, correct processes are 
guaranteed to receive at least n — t messages and thus they decide. 

Next we prove agreement. To have a bound on the number of possible decisions we look at 
how many values different from the default value can be decided. Let / be the number of actual 
failures. We have that any group oin — t — f correct processes that start with the same value can 
be forced by the / faulty processes to decide that value. Notice that since f < t < n/2 we have 
that n — t — f > 1. Hence the number of decisions can be as big as the number of possible disjoint 
groups of n — t — f correct processes, plus one to take into account the default value. There can 
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be at most (n — f)/(n — t — f) such groups. This function is an increasing function of / and thus 
it achieves its maximum value for / = t. Hence the number of different decisions we can have is at 
most (n — t)/(n — 2t) + 1. Since k > (n — t)/(n — 2t) + 1 agreement is satisfied. 

Finally we prove validity. Assume that all processes are correct and start with v. Then clearly 
v is the only decision. □ 

Lemma 12.2.2 PROTOCOL A solves ICC(k,t,WV2) in the Byzantine model fort > n/2 andk > t+1. 

Proof: Termination and validity are as in the previous lemma. Next we prove agreement. Let / be 
the number of actual failures. We distinguish two cases: (i) f < n — t — 2 and (ii) f > n — t — 2. 
In case (i) we have that for any n — t messages received by a process, at least two of them are sent 
by correct processes. Hence for each different value v ^ vo decided by some process at least two 
correct processes have sent that value. Hence no more than n/2 values different from the default 
value vo can be decided. Hence at most n/2 + 1 different values can be decided in case (i). In case 
(ii) the number of correct processes is strictly less than t + 2. Hence we cannot have more than t+1 
different decisions. Putting together the two cases, we have that the number of different decisions 
is at most max{n/2 + 1, t + 1} = t + 1 < k. □ 

Next we provide a generalized version of the "echo" protocol of Bracha and Toueg [22] , which we 
call £-echo, where £ > 2. (The 1-echo protocol is Bracha and Toueg's echo protocol.) The £-echo 
protocols will be used to provide a family of protocols for AX(sv2). 

£-echo protocol: To £-echo broadcast a message m, the sender s sends the message 
(init,s,m) to all other processes. When a process p receives the first (init,s,m) from s, 
it sends the message (echo,s,m) to all other processes. Subsequent init messages from 
s are ignored. If process p receives message (echo ,s,m) from more than (n + £t)/(£ + 1) 
processes, then process p accepts message m as sent by the sender process s. 

Lemma 12.2.3 In a system with t < £n/(2£+ 1), if a sender s uses the £-echo protocol to send a 
message m then: 

(i) Correct processes accept at most £ different messages. 

(ii) If s is correct, every correct process accepts m. 

Proof: First we prove (i). By sake of contradiction assume that correct processes accept £ + 1 
different messages mi, m 2 , ..., m^ + i. Then there must be £ + 1 correct processes, say pi,p2, ...,pi+\, 
such that process pi receives more than (n+£t)/(£ + l) echos with m^, for each i = 1,2,...,£ + 1. Thus 
there must be a total of more than n + £t echos sent for the messages mi,m 2 , ...,m^ + i. Let / be the 
actual number of faulty processes. Since a faulty process can send £+1 different echos (it can echo mi 
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to pi, m2 to P2 and so on) we have that strictly more than n + £t—(£ + l)f > n + £f—(£ + l)f = n — f 
echos are sent by correct processes. This implies that at least one correct process sent two different 
echos, which is not possible. 

Now we prove (it). If the sender is correct, then it sends an init message for m to all other 
processes. Any correct process will receive this and broadcast an echo message for m. Because there 
are at most t < (n + £t)/(£+ 1) faulty processes, no correct process accepts any message other than 
m. Since there are at least n — t correct processes, it is sufficient that n — the strictly greater than 
(n + £t)/(£ + 1) in order to guarantee that any correct process receives enough echo messages to be 
able to accept m. Since t < £n/(2£ + 1) we have that n — t > (n + £t)/(£ +1). Q 

The £-echo protocol is used to define a family of protocols for K.C(k,t,sv2) as follows. 

protocol c(£): Each process broadcasts its input using the £-echo protocol and waits 
for n — t messages to be accepted, where one of these n — t messages is the process' own 
message. If n — 2t messages contain the same value v, then the process decides v, else it 
decides a default value vo- 

Lemma 12.2.4 PROTOCOL c(£) solves ICC(k,t,SV2) in the Byzantine model for t < 2 k+i-\ n an< ^ 

Proof: We start by proving termination. Since there are at least n — t correct processes, each correct 
process eventually accepts at least n — t messages broadcast by £-echo and is able to make a decision. 

Now we prove agreement. For a contradiction assume that k + 1 values are decided. One of them 
could be the default value, but at least k values, different from the default value, are decided. By the 
protocol it is necessary that there be k sets <?i, g 2 , ■ ■■,<?*;, each consisting of at least n — 2t processes, 
such that some correct process accepts a value Vi from each process in <jj (with Vi ^ Vj for i ^ j). 
Hence correct processes accept at least k(n—2t) values broadcast by ^-echo. Each faulty process can 
contribute £ different values, and so the number of different senders is at least k(n — 2t) — (£ — l)t. 
However since t < 2 k+t-i n i we nave that k > " „_ 2 t an( i thus k(n — 2t) — (£ — l)t > n, which 
implies that there must be more than n processes, a contradiction. 

Finally we prove validity. Assume that all correct processes start with value v. We have to prove 
that a correct process decides v. Let p be a correct process. First we observe that since p starts with 
v it either decides v or«o- Hence it suffices to prove that p receives at least n — 2t messages with 
v. Among the n — t messages p receives at least n — 2t are from correct processes. Hence process p 
receives at least n — 2t messages with v. Q 

Finally we provide a protocol for AX(wvl). 

protocol D: Processes pi,p2, ...,pt+i each broadcasts its input value. A process that 
receives a value Vi from pi, i s {1, 2, ..., t + 1}, broadcasts an (echo,Vi,pi) message and 
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never echos a value for pi again. Processes pi,p2, ■ ~,Pk each decides on its own value. 
Every other process decides the first value Vi, i € {1, . . . ,t + 1}, for which it receives 
identical echos (echo,Vi,pi) from n — t processes. 

In PROTOCOL D, we say that a process accepts a value Vi from pi if it receives identical echos for Vi 
from at least n — t processes. We define the following functions 

in- f if n - t - f < 

y(M ' /)= l*+l-/ + /U^7J tf»-*-/>0 



and 



Z(n,t) = max {min{V(n, t, /),n — /}}. 



Lemma 12.2.5 PROTOCOL D solves K.C(k,t,WVl) in the Byzantine model for k > Z(n,t). 

Proof: We start by proving termination. At least one process among pi, ...,p t+ i is correct, and at 
least n — t receive its value and echo it. Hence it is guaranteed that each correct process receives at 
least one set of identical n — t echo messages and thus is able to decide. 

Next we prove validity. Assume that there are no failures. Then all processes are correct and 
thus the values accepted by any process are input values. All decisions are one of the accepted 
values. Hence validity wvl is satisfied. 

Finally we prove agreement. We compute an upper bound on the number of different decisions 
for each possible value of /, that is the number of actual failures. By definition, < / < t. We 
distinguish two cases: (i) n — t — f < and (ii) n — t — f > 0. In case (i) a correct process may 
be forced to communicate only with faulty processes. In this case we simply bound the number of 
decisions with the number of correct processes, that is n — /. In case (ii) the total number of values 
that correct processes accept from one faulty process is bounded by [_ "7-/ J- Indeed, a correct 
process accepts a value when receiving at least n — t echos, at least n — t — f of which are from 
correct processes. Thus the total number of values from pi, ...,p t +\ accepted by correct processes 
is at most (t + 1 — /) + /[ w "~/, J, that is the number of values sent by correct processes plus the 
number of values that correct processes may be forced to accept because of the Byzantine behavior 
of faulty processes. Hence the number of different decisions that we can have isi + 1 — / + / |_ „"7-/ J • 
It is possible that this bound is bigger than n — /. In such a case, we can bound the number of 
different decisions by n — f. Summarizing the two cases we have that for any /, we bound the 
number of decisions byn — fiin — t — f < and by min{£ + 1 — / + / |_ „"7-f -I : ^ — /} if rz — t — / > 0. 
The maximum over all possible values of / is given by Z(n,t). Hence we have that the number of 
decisions is always at most Z(n,t), as required. □ 
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We note that when t < ^ , |_ „" f if ] = 1 for all < / < t, and therefore, the protocol above 
guarantees agreement for any k > t (see Figure 12-1). 

12.3 Remarks 

For the Byzantine model, the impossibility results and protocols we have provided in this section 
leave a small gap for the ICC problem defined with validities WV2, RV2 and SV2 and a substantial 
gap for AX(wvl). An interesting open problem is to fill in this gap. 
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Chapter 13 



Conclusions 



The fc-set consensus problem is an abstraction of many coordination problems in a distributed system 
that can suffer process failures. In this thesis we have investigated the fc-set consensus problem in 
asynchronous message passing distributed systems. We have extended previous work by exploring 
several variations of the problem definition and model, including for the first time investigation of 
Byzantine failures. We have shown that the precise definition of the validity requirement, which 
characterizes what decision values are allowed as a function of the input values and whether failures 
occur, is crucial to the solvability of the problem. For example, we show that allowing default 
decisions in case of failures makes the problem solvable for most values of fc in face of minority- 
failure, even in face of the most severe type of failures (Byzantine). We have introduced six validity 
conditions for this problem (all considered in various contexts in the literature), and demarcate the 
line between possible and impossible for each case. In many cases this line is different from the one 
of the originally studied fc-set consensus problem. 

In this thesis we have considered asynchronous systems. A natural question to ask is: what 
happens in synchronous systems? Clearly any algorithm that works in asynchronous systems works 
also in synchronous systems. 

Let us first consider the case of stop failures. The FloodSet algorithm (see for example [65, Ch. 
7]) solves the ICC problem in synchronous systems with stop failures. It tolerates any number of 
failures, that is, it works for any t < n. The validity condition considered is validity RVl. Hence this 
algorithm works also for validities RV2, WVl and WV2. The impossibility proof for AX validity SVl 
that we have provided for asynchronous systems works also for synchronous systems. Hence there is 
no ICC protocol for validity SVl, synchronous systems, stop failures. The above cover pretty much 
all the cases we have considered. The only open case is validity SV2: we can use the algorithm for 
asynchronous system that solves the problem for t < n/A when fc = and for t < n/3 for fc > 3. For 
other cases we don't know. 
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For the Byzantine case there is no work on K.C for synchronous system. It is known that 
/CC(l,sv2) can be solved if and only if t < n/3 [78, 62]. The EIGbyz algorithm (see [65]) pro- 
vides a solution AX(l,sv2) when t < n/3. Lamport [60] proved that also AX(l,£,wv2) can be solved 
if and only if t < n/3. Clearly EIGbyz solves also /CC(1,WV2). No results for K.C(k), k > 2, are 
known for synchronous systems with Byzantine failures. Obviously one can use the algorithm pro- 
vided in this thesis for asynchronous system. However this still leaves large gaps for the values of k 
and t for which we don't know if the problem is solvable or not. 

Another natural question to ask is what happens in shared memory systems. Algorithms that 
work for message passing systems work also for shared memory system because a channel can be 
simulated with shared memory. The FLP impossibility proof can be generalized to shared memory 
([64]). The impossibility result of [20, 52, 84] works also for shared memory. In some of the 
impossibility proofs we provided in this thesis we used the fact that the system is message-passing; 
hence we do not know whether the impossibility results still hold in the shared memory settings. 
We conjecture that the techniques used in this thesis can be used to provide a similar analysis for 
the shared memory models. 



173 



Bibliography 



[1] ACM. Communications of the ACM 39(4), special issue on Group Communications Systems, April 
1996. 

[2] D. Agrawal and A. El Abbadi. An efficient and fault-tolerant solution for distributed mutual exclusion. 
ACM Transactions on Computer Systems, 9(l):l-20, 1991. 

[3] Ehab S Al-Shaer, Hussein Abdel-Wahab, and Kurt Maly. HiFi: A new monitoring architecture for 
distributed system management. In 19th International Conference on Distributed Computing Systems 
(ICDCS), pages 171-178, June 1999. 

[4] Y. Amir, D. Breitgand, G. Chockler, and D. Dolev. Group communication as an infrastructure for dis- 
tributed system management. In 3rd International Workshop on Services in Distributed and Networked 
Environment (SDNE), pages 84-91, June 1996. 

[5] Y. Amir, G. V. Chokler, D. Dolev, and R. Vitenberg. Efficient state transfer in partitionable environ- 
ments. In 2nd European Research Seminar on Advances in Distributed Systems (ERSADS'97), pages 
183-192. BROADCAST (ESPRIT WG 22455), Operating Systems Laboratory, Swiss Federal Institute 
of Technology, Lausanne, March 1997. Full version available as Technical Report CS98-12, Institute of 
Computer Science, The Hebrew University, Jerusalem, Israel. 

[6] Y. Amir, D. Dolev, S. Kramer, and D. Malki. Transis: A Communication Sub-System for High Avail- 
ability. In 22nd IEEE Fault- Tolerant Computing Symposium (FTCS), July 1992. 

[7] Y. Amir, D. Dolev, P. Melliar-Smith, and L. Moser. Robust and efficient replication using group com- 
munication. Technical Report CS94-20, Institute of Computer Science, Hebrew University, Jerusalem, 
Israel, 1994. 

[8] Y. Amir and A. Wool. Optimal availability quorum systems: Theory and practice. Information Pro- 
cessing Letters, 65:223-228, 1998. 

[9] T. Anker, G. Chockler, I. Keidar, M. Rozman, and J. Wexler. Exploiting group communication for 
highly available video-on-demand services. In Proceedings of the IEEE 13th International Conference 
on Advanced Science and Technology (ICAST 97) and the 2nd International Conference on Multimedia 
Information Systems (ICMIS 97), pages 265-270, April 1997. 

[10] T. Anker, D. Dolev, and I. Keidar. Fault tolerant video-on-demand services. In 19th International 
Conference on Distributed Computing Systems (ICDCS), pages 244-252, June 1999. 



174 



[11] H. Attiya. A direct proof of the asynchronous lower bound for k-set consensus. In Proceedings of 
the YI ih ACM Symposium on Principle of Distributed Computing (PODC), page 314. Puerto Vallarta, 
Mexico, 1998. 

[12] H. Attiya, A. Bar-Noy, and D. Dolev. Sharing memory robustly in message passing systems. Commu- 
nications of the ACM, 42(1):124-142, 1995. 

[13] H. Attiya, A. Bar-Noy, D. Dolev, and D. Peleg. Renaming in an asynchronous environment. Journal 
of the ACM, 37(3):524-548, July 1990. 

[14] O. Babaoglu, R. Davoli, L. Giachini, and M. Baker. Relacs: A communication infrastructure for 
constructing reliable applications in large-scale distributed systems. TR UBLCS94-15, Department of 
Computer Science, University of Bologna, 1994. 

[15] O. Babaoglu, R. Davoli, L. Giachini, and P. Sabattini. The inherent cost of strong-partial view syn- 
chronous communication. In Proceedings of Workshop on Distributed Algorithms on Graphs, pages 
72-86, 1995. 

[16] O. Babaoglu, R. Davoli, and A. Montresor. Failure detectors, group membership and view-synchronous 
communication in partitionable asynchronous systems. TR UBLCS95-18, Department of Computer 
Science, University of Bologna, 1995. 

[17] O. Babaoglu, R. Davoli, and A. Montresor. Partitionable Group Membership: Specification and Algo- 
rithms. TR UBLCS97-1, Department of Computer Science, University of Bologna, January 1997. 

[18] M. Ben-Or. Another advantage of free choice: Completely asynchronous agreement protocols. In 
Proceedings of the 2 n ACM Symposium on Principle of Distributed Computing (PODC), pages 27-30. 
Montreal, Canada, 1983. 

[19] K.P. Birman and R. van Renesse. Reliable Distributed Computing with the Isis Toolkit. IEEE Computer 
Society Press, Los Alamitos, CA, 1994. 

[20] E. Borowsky and E. Gafni. Generalized flp impossibility result for t-resilient asynchronous computations. 
In Proceedings of the 25 th ACM Symposium on Theory of Computing (STOC), pages 91-100, 1993. 

[21] G. Bracha. An o(n log n) expected rounds randomized byzantine generals algorithm. In Proceedings of 
the 4 th ACM Symposium on Principle of Distributed Computing (PODC), 1985. 

[22] G. Bracha and S. Toueg. Resilient consensus protocols. In Proceedings of the 2 nd ACM Symposium on 
Principle of Distributed Computing (PODC), pages 12-26, 1983. 

[23] T.D. Chandra, V. Hadzilacos, S. Toueg, and B. Charron-Bost. On the impossibility of group member- 
ship. In Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing, 
pages 322-330, Philadelphia, Pennsylvania, May 1996. 

[24] S. Chaudhuri. Set consensus problems in totally asynchronous systems. Information and Computation, 
105(1), July 1993. 

[25] F. Cristian. Reaching agreement on processor group membership in synchronous distributed systems. 
Distributed Computing, 4(4), 1991. 



175 



[26] F. Cristian. Group, majority and strict agreement in timed asynchronous distributed systems. In 
Proceedings of the 26th Conference on Fault-Tolerant Computer Systems, pages 178-187, 1996. 

[27] F. Cristian and F. Schmuck. Agreeing on processor group membership in asynchronous distributed 
systems. Technical Report CSE95-428, University of California-San Diego, La Jolla, CA 92093-0114, 
1995. 

[28] D. Davcev and W. Buckhard. Consistency and recovery control for replicated files. In ACM Symp. on 
Operating Systems Principles, volume 10, pages 87-96, 1985. 

[29] R. De Prisco. Revisiting the Paxos algorithm. Master's thesis, Department of Electrical Engineering 
and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, June 1997. Also 
MIT/LCS/TR-717. 

[30] R. De Prisco, B. Lampson, and N. Lynch. Revisiting the Paxos algorithm. In Proceedings of the \\ th 
Workshop on Distributed Algorithms (WDAG), volume 1320 of Lecture Notes in Computer Science, 
pages 111-125. Springer- Verlag, 1997. Will appear in Theoretical Computer Science. 

[31] D. Dolev, C. Dwork, and L. Stockmeyer. On the minimal syncrhony needed for distributed consensus. 
Journal of the ACM, 34(l):77-97, January 1987. 

[32] D. Dolev, N. Lynch, S.S. Pinter, E.W. Stark, and W.E. Weihl. Reaching approximate agreement in the 
presence of faults. Journal of the ACM, 33(3):499-516, July 1986. 

[33] D. Dolev and D. Malkhi. The transis approach to high availability cluster communications. Communi- 
cations of the ACM, 39(4):64-70, 1996. 

[34] D. Dolev, D. Malkhi, and R. Strong. A framework for partitionable membership service. Technical 
Report TR95-4, Institute of Computer Science, Hebrew University, Jerusalem, Israel, March 1995. 

[35] C. Dwork, N. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. Journal of 
the ACM, 35(2):288-323, April 1988. 

[36] A. El Abbadi and S. Dani. A dynamic accessibility protocol for replicated databases. Data and knowledge 
engineering, 6:319-332, 1991. 

[37] A. El Abbadi and S. Toueg. Maintaining availability in partitioned replicated databases. ACM Trans- 
actions on Database Systems, 14(2):264-290, 1989. 

[38] P. D. Ezhilchelvan, A. Macedo, and S. K. Shrivastava. Newtop: a fault tolerant group communication 
protocol. In 15th International Conference on Distributed Computing Systems (ICDCS), June 1995. 

[39] A. Fekete. Asymptotically optimal algorithms for approximate agreement. Distributed Computing, 
4(l):9-29, March 1990. 

[40] A. Fekete. Asynchronous approximate agreement. Information and Computation, 115(1):95-124, 
November 15 1994. 

[41] A. Fekete, N. Lynch, and A. A. Shvartsman. Specifying and using a partitionable group communication 
service. In Proceedings of the 16* ACM Symposium on Principle of Distributed Computing (PODCJ, 
pages 53-62, Santa Barbara, CA, August 1997. 



176 



[42] M.J. Fischer. The consensus problem in unreliable distributed systems (a brief survey). Research Report 
YALEU/DCS/RR-273, Yale University, Department of Computer Science, New Haven, CT 06520, June 
1983. 

[43] M.J. Fischer, N. Lynch, and M. Paterson. Impossibility of distributed consensus with one faulty process. 
Journal of the ACM, 32(2):374-382, April 1985. 

[44] R. Friedman and R. van Renesse. Strong and weak virtual synchrony in Horus. Technical Report 
TR95-1537, Department of Computer Science, Cornell University, Ithaca, NY, 1995. 

[45] R. Friedman and A. Vaysburg. Fast replicated state machines over partitionable networks. In 16th 
IEEE International Symposium on Reliable Distributed Systems (SRDS), October 1997. 

[46] R. Friedman and A. Vaysburg. High-performance replicated distributed objects in partitionable envi- 
ronments. Technical Report 97-1639, Dept. of Computer Science, Cornell University, Ithaca, NY 14850, 
USA, July 1997. 

[47] D. Gifford. Weighted voting for replicated data. In Proceedings of the ACM Symposium on Operating 
Systems Principles, pages 150-162, 1979. 

[48] R. Guerraoui and A. Schiper. Transaction model vs virtual synchrony model: bridging the gap. In 
Theory and Practice in Distributed Systems, LNCS 938, pages 121-132. Springer- Verlag, September 
1995. 

[49] Mark Hayden. Ensemble Reference Manual. Cornell University, 1996. 

[50] M. Herlihy. A quorum-consensus replication method for abstract data types. ACM Transactions on 
Computer Systems, 4(l):32-53, 1986. 

[51] M. Herlihy. Dynamic quorum adjustment for partitioned data. ACM Transactions on Database Systems, 
12(2):170-194, June 1987. 

[52] M. Herlihy. The asynchronous computability theorem for t-resilient tasks. In Proceedings of the 25 th 
ACM Symposium on Theory of Computing (STOC), pages 111-120, 1993. 

[53] M. Hiltunen and R. Schlichting. Properties of membership services. In Proceedings of the 2nd Interna- 
tional Symposium on Autonomous Decentralized Systems, pages 200-207, 1995. 

[54] F. Jahanian, S. Fakhouri, and R. Rajkumar. Processor group membership protocols: Specification, de- 
sign and implementation. In Proceedings of the 12th IEEE Symposium on Reliable Distributed Systems, 
pages 2-11, 1993. 

[55] S. Jajodia and D. Mutchler. Dynamic voting algorithms for maintaining the consistency of a replicated 
database. A CM Trans. Database Systems, 15(2):230-280, 1990. 

[56] I. Keidar. A Highly Available Paradigm for Consistent Object Replication. Master's thesis, Institute 
of Computer Science, The Hebrew University of Jerusalem, Jerusalem, Israel, 1994. Also Institute of 
Computer Science, The Hebrew University of Jerusalem Technical Report CS95-5, and available from: 
http : //www . cs . hu j i . ac . il/~transis/publications . html. 

[57] I. Keidar and D. Dolev. Efficient message ordering in dynamic networks. In Proceedings of the 15* ft 
ACM Symposium on Principle of Distributed Computing (PODC), pages 68-76, May 1996. 

177 



[58] Roger Khazan. Group communication as a base for a load-balancing, replicated data service. Mas- 
ter's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of 
Technology, Cambridge, MA 02139, June 1998. 

[59] Roger Khazan, Alan Fekete, and Nancy Lynch. Multicast group communication as a base for a load- 
balancing replicated data service. In 12th International Symposium on Distributed Computing, pages 
258-272, Andros, Greece, September 1998. 

[60] L. Lamport. The weak byzantine generals problem. Journal of the ACM, 30(3):254-280, 1983. 

[61] L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133-169, May 
1998. Also Research Report 49, DEC SRC, Palo Alto, CA, 1989. 

[62] L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on 
Programming Languages and Systems, 4(3):382-401, July 1982. 

[63] N. Lesley and A. Fekete. Providing virtual synchrony for group communication services. Preprint, 1997. 

[64] M. Loui and H. Abu-Amara. Memory requirements for agreement among unreliable asynchronous 
processes. In Parallel and Distributed Computing, pages 163-183. JAI Press, Greenwich CT, 1987. 
Volume of Advances in Computing Research. 

[65] N. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, Inc., San Mateo, CA, March 1996. 

[66] N. Lynch and A. A. Shvartsman. Robust emulation of shared memory using dynamic quorum- 
acknowledged broadcasts. In Proceedings of the 27 th Annual International Symposium on Fault-Tolerant 
Computing (FTCS), pages 272-281, Seattle, Washington, USA, June 1997. IEEE. 

[67] N. Lynch and M. R. Tuttle. An introduction to input/output automata. CWI-Quarterly, 2(3):219-246, 
September 1989. Centrum voor Wiskunde en Informatica, Amsterdam, The Netherlands. Technical 
Memo MIT/LCS/TM-373, Laboratory for Computer Science, Massachusetts Institute of Technology, 
Cambridge, MA 02139, November 1988. 

[68] D. Malkhi and M.K. Reiter. Byzantine quorum systems. Distributed Computing, 11:203-13, 1998. 

[69] D. Malkhi, M.K. Reiter, and A. Wool. The load and availability of byzantine quorum systems. In 
Proceedings of the 16* ACM Symposium on Principle of Distributed Computing (PODC), pages 249- 
257, August 1997. 

[70] D. Malkhi, M.K. Reiter, and R. Wright. Probabilistic quorum systems. In Proceedings of the 16* ft ACM 
Symposium on Principle of Distributed Computing (PODC), pages 267-273, August 1997. 

[71] C. Malloth and A. Schiper. View synchronous communication in large scale networks. In 2nd Open 
Workshop of the ESPRIT project BROADCAST (Number 6360), July 1995 (also available as a Technical 
Report Nr. 94/84 at Ecole Polytechnique Federale de Lausanne (Switzerland), October 1994). 

[72] L. Moser, Y. Amir, P. Melliar-Smith, and D. Agrawal. Extended virtual synchrony. In Proceedings of the 
14th IEEE International Conference on Distributed Computing Systems, pages 56-65, Poznan, Poland, 
June 1994. Full version appears in TR ECE93-22, Dept. of Electrical and Computer Engineering, 
University of California, Santa Barbara, CA. 



178 



[73] L. E. Moser, P. M. Melliar- Smith, D. A. Agarwal, R. K. Budhia, and C. A. Lingley-Papadopoulos. 
Totem: A fault-tolerant multicast group communication system. Communications of the ACM, 39(4), 
April 1996. 

[74] M. Naor and A. Wool. The load, capacity and availability of quorum systems. SIAM Journal on 
Computing, 27(2):423-447, April 1998. 

[75] G. Neiger. A new look at membership services. In Proceedings of the 15th Annual ACM Symposium on 
Principles of Distributed Computing, pages 331-340, Philadelphia, Pennsylvania, May 1996. 

[76] B. Oki and B. Liskov. Viewstamped replication: A general primary copy method to support highly avail- 
able distributed systems. In Proceedings of the Seventh ACM Symposium on Principles of Distributed 
Computing, pages 8-17, Toronto, Ontario, Canada, August 1988. 

[77] J. Paris and D. Long. Efficient dynamic voting algorithms. In Proceedings of the 13* ft International 
Conference on Very Large Data Base, pages 268-275, 1988. 

[78] M. Pease, R. Shostak, and L. Lamport. Reaching agreeement in the presence of faults. Journal of the 
ACM, 27(2):228-234, April 1980. 

[79] D. Peleg and A. Wool. The availability of quorum systems. Information and Computation, 123(2):210- 
223, 1995. 

[80] M. Rabin. Randomized byzantine generals. In Proceedings of the 15* ft ACM Symposium on Theory of 
Computing (STOC), pages 403-409, 1983. 

[81] A. Ricciardi. The group membership problem in asynchronous systems. Technical Report TR92-1313, 
Department of Computer Science, Cornell University, Ithaca, NY, 1992. 

[82] A. Ricciardi and K.P. Birman. Using process groups to implement failure detection in asynchronous 
environments. In Proceedings of the 10* ACM Symposium on Principle of Distributed Computing 
(PODC), pages 341-352, August 1991. 

[83] A. Ricciardi, A. Schiper, and K.P. Birman. Understanding partitions and the "no partitions" assump- 
tion. Technical Report TR93-1355, Department of Computer Science, Cornell University, Ithaca, NY, 
1993. 

[84] M. Saks and F. Zaharoglou. Wait-free /c-set agreement is impossible: The topology of public knowledge. 
In Proceedings of the 25 th ACM Symposium on Theory of Computing (STOC), pages 101-110, 1993. 

[85] A. Schiper and M. Raynal. From group communication to transactions in distributed systems. Com- 
munications of the ACM, 39(4):84-87, April 1996. 

[86] R. van Renesse, K.P. Birman, and S. Maffeis. Horus: A flexible group communication system. Com- 
munications of the ACM, 39(4):76-83, 1996. 

[87] R. Vitenberg, I. Keidar, G. V. Chockler, and D. Dolev. Group Communication Specifications: A 
Comprehensive Study. Technical report, Institute of Computer Science, The Hebrew University of 
Jerusalem, Jerusalem, Israel, September 1999. Also Technical Report MIT-LCS-TR-790, Massachusetts 
Institute of Technology, Laboratory for Computer Science and Technical Report CS0964, Computer 
Science Department, the Technion, Haifa, Israel. 

179 



[88] E. Yeger Lotem, I. Keidar, and D. Dolev. Dynamic voting for consistent primary components. In 
Proceedings of the Sixteenth Annual ACM Symposium on Principles of Distributed Computing, pages 
63-71, Santa Barbara, CA, August 1997. 

[89] E. Yeger Lotem, I. Keidar, and D. Dolev. Dynamic voting for consistent primary components. In 
Proceedings of the Sixteenth Annual ACM Symposium on Principles of Distributed Computing, pages 
63-71, Santa Barbara, CA, August 1997. 



180 



